Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
In recent years, with the rapid development of voice communication technology and the increasing requirements of users on voice quality, the method based on deep neural network supervised learning accelerates the whole research of voice noise reduction. Speech noise reduction refers to separating a target speech signal from background noise to eliminate or suppress the background noise. As a mode, a large amount of real target voice signals and noise signals can be randomly mixed to be used as input of a neural network, after supervised training, the neural network can automatically learn and output the target voice signals from training samples, and the noise reduction effect is improved. However, as the sampling rate of the target speech is increased, the calculation amount of the neural network is increased, so that the neural network cannot be widely applied.
For example, in the singing scenario, the target speech signal is a human singing voice, with an audio signal sampling rate of typically 44.1 kHz. Singing voice noise reduction is a special scene in voice noise reduction, and the traditional voice noise reduction technology is applied to a scene of high-sampling-rate audio, so that the effect is not ideal; some existing voice noise reduction methods based on the deep neural network are also difficult to apply to singing voice noise reduction scenes due to the fact that the network parameters are too large to cause too large calculation amount.
In view of the above problems, the inventors have found through long-term research that, for a section of audio signal, the non-stationarity of the signal input to the neural network model will increase the amount of calculation of the neural network, and the direct input of the audio signal to the convolutional neural network model will further increase the amount of calculation of the neural network, and the audio noise reduction effect is not obvious. In order to reduce the calculation amount of the neural network and improve the noise reduction effect of the audio, the inventor finds that the spectral energy of the audio is converted from a linear time domain to a frequency domain (Bark domain), the representation of the spectral energy in the Bark domain is used as a Bark feature, the Bark feature is used as the input of the neural network, and the neural network adopts a brand new split gate convolution layer structure, so that the convolution neural network structure can reduce the increase of the calculation amount of the neural network while increasing the learning visual field of a convolution kernel, increase the nonlinear degree of a neural network model and improve the noise reduction effect of the audio.
Therefore, the method, the device, the electronic device and the storage medium for audio noise reduction provided by the embodiment of the application are provided, the Bark feature of the first audio is input into a target neural network model obtained through pre-training, the amplitude parameter of the voice signal is further calculated, the target voice is obtained based on the amplitude parameter, the calculated amount of the neural network model is reduced, and the background noise in the voice information is reduced.
For the convenience of describing the scheme of the present application in detail, the following first describes a split-gate convolutional neural network model provided in the embodiments of the present application with reference to the drawings.
Please refer to fig. 1, which is a schematic diagram of a network structure of an exemplary split-gate convolutional layer in a voice noise reduction method according to an embodiment of the present application. The split-gate convolutional layer comprises four two-dimensional convolutional layers, a first activation function module and a second activation function module. In one aspect, the four two-dimensional convolutional layers include a first causal convolutional layer, a second causal convolutional layer, a third convolutional layer, and a fourth convolutional layer. The first activation function module is connected with the third convolution layer, and the second activation function module is connected with the fourth convolution layer. The convolution kernel size of the first cause and effect convolutional layer may be kw × 1, the convolution kernel size of the second cause and effect convolutional layer may be 1 × kh, the sizes of the convolution kernels of the third convolutional layer and the fourth convolutional layer may be the same, and the number of channels of the convolution kernels of the four two-dimensional convolutional layers is the same, that is, the number of channels (for example, may be c as shown in fig. 1) of the convolution kernels of the first cause and effect convolutional layer, the second cause and effect convolutional layer, the third convolutional layer, and the fourth convolutional layer is the same. The third convolution layer may multiply the output of the first activation function module with the output of the second activation function module when connected to the first activation function module and the fourth convolution layer is connected to the second activation function module, thereby obtaining the final output of the split-gate convolution layer.
The first active function module may adopt a reduced Linear Unit (Linear rectification function), and the second active function module may adopt a Sigmoid function. Optionally, in practical implementation, the first activation function module and the second activation function module may also use other functions, which is not limited herein.
Optionally, in the embodiment of the present application, specific values of kw, kh, and c are not limited. As a way, by adjusting the three parameters, the split-gate convolutional layer can more effectively learn the input speech feature information, and further better identify the required target speech, or remove the background noise in the noisy audio.
Optionally, in one implementation, as shown in fig. 1, the four two-dimensional convolutional layers may be a causal convolutional layer (kw x 1, c), a causal convolutional layer (1 x kh, c), and two separate convolutional layers (1 x 1, c), respectively. Wherein kw and kh are convolution kernel sizes of the causal convolution layer, and c is the number of convolution kernel channels of the causal convolution layer. By separating the convolution kernel into two elongated convolution kernels, i.e. two separated convolutional layers (1 x 1, c) as shown in fig. 1, the learning view of the convolutional layers can be enlarged, while the amount of computation of the convolutional layers can be reduced by using the separated convolutional layers.
In order to better understand the scheme of the present application, the Bark feature related to the embodiments of the present application is described below.
The Bark domain is a psychoacoustic measure of sound. Because of the particular configuration of the cochlea of the human ear, the human auditory system produces a series of Critical bands. The critical frequency band is a sound frequency band, and a sound signal in the same critical frequency band is easily masked, that is, the sound signal in the critical frequency band is easily masked by another signal with large energy and close frequency, so that the human auditory system cannot feel the sound signal. As a way of example, if the sound signal is converted into critical frequency bands in the frequency dimension, each critical frequency band becomes a Bark, thus converting the sound signal from the linear frequency domain into the Bark domain. Optionally, the embodiment of the present application converts the sound signal from the linear frequency domain to the Bark domain using the following formula:
where arctan is an arctangent function, f is the linear frequency dimension of the sound signal, and Bark (f) is the Bark domain representation of the sound signal.
Alternatively, after converting the sound signal from the linear frequency dimension to the Bark domain dimension, the audio spectral energy characteristic of the sound signal in the linear frequency dimension needs to be converted to the Bark characteristic in the Bark domain dimension. As one way, the value obtained by performing short-time fourier transform on audio (i.e. the sound signal) can be expressed as:
stft(t,f)=x(t,f)+i×y(t,f),
wherein stft(t,f)The spectral feature represented in the frequency domain is composed of a vector, i.e., x + yi in the equation, where x represents the real part of the spectral feature and y represents the imaginary part of the feature.
Further, the linear spectral energy of the audio can be calculated by the following formula:
the conversion of the linear spectral energy signature into Bark signature can then be expressed as:
Bark_feature=mat_mul(stft_energy,stft2bark_matrix),
where mat _ mul represents the matrix multiplication and stft2 barg _ matrix represents the conversion matrix for the Bark feature.
Optionally, after learning the Bark feature, the neural network outputs the ratio of the Bark value of the target voice (e.g. singing voice) to the Bark value with noise frequency, i.e. Bark _ mask. The method utilizes the same principle to carry out conversion, obtains the ratio mask of the frequency spectrum amplitude of the target voice and the frequency spectrum amplitude of the noisy audio under the linear frequency dimension, and has the following conversion formula:
mask=mat_mul(bark_mask,bark2stft_matrix),
wherein, Bark2stft _ matrix is an inverse transformation matrix of the Bark characteristic.
By one way, the target neural network model including at least one split gate convolution layer and at least one long-short term memory layer is adopted in the embodiment of the application to remove the background noise signal in the noisy audio. And inputting the Bark characteristics of the audio into the target neural network model to obtain the denoised audio characteristics, namely the audio characteristics of the target voice. Optionally, the split-gate convolutional layer is configured to output a texture feature corresponding to the target speech signal according to the noisy frequency feature, and the long-term and short-term memory layer is configured to output a denoised audio feature, that is, a spectral feature (including a spectral amplitude and a spectral energy) of the target speech in a Bark domain, according to the texture feature. Wherein, the Long and Short Term Memory layer is a Long and Short Term Memory network (LSTM). The following briefly describes the long and short term memory network used in the present application with reference to the drawings.
Please refer to fig. 2, which is a schematic structural diagram of a long-time and short-time memory network suitable for the voice denoising method according to the embodiment of the present application. As shown in fig. 2, the LSTM includes three control gates, which are a forgetting gate, an input gate, and an output gate. The activation function sigma in each threshold represents a sigmoid activation function. The output h of the previous layer can be output by the S-shaped activation functiont-1And current input XtThe processing is carried out to determine the state C of the cells at the previous moment by the following formulat-1Data to be forgotten:
ft=σ(Wt·[ht-1,Xt]+bt),
wherein f istA value of 0 indicates complete forgetting and a value of 1 indicates complete acceptance.
Further, by activating the function in a sigmoid mannerTo determine which information to accept, tanh generates new candidate values
Combining both, the hidden layer state C can be addressed by
t-1Updating:
it=σ(Wi·[ht-1,Xt]+bi)
further, it can be determined which part of information is output through the activation function, tanh generates a new output candidate value, and finally the value h of the hidden layer is outputt:
ot=σ(Wo·[ht-1,Xt]+bo)
ht=ot·tanh(Ct)
Alternatively, the LSTM may comprise a plurality of layers as shown in fig. 2, each layer accepting the hidden layer output, the state vector and the currently input data of the previous layer as input, and updating the shadow hidden layer output and the state vector of the next layer, so as to be able to store the past key information for predicting the future information.
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Referring to fig. 3, a flowchart of a method for reducing noise of a voice according to an embodiment of the present application is shown, where the method for reducing noise of a voice is provided in this embodiment, and the method can be applied to the electronic device, and the method includes:
step S110: the method comprises the steps of obtaining a first audio frequency, wherein the first audio frequency is an audio frequency mixed with a voice signal and a background noise signal.
The first audio may be an audio mixed with a speech signal of a target sampling rate and a background noise signal. Optionally, the target sampling rate may be a high sampling rate, for example, 44.1kHz or 48kHz, or may be a non-high sampling rate, for example, 11.025kHz, 22.05kHz, 24kHz, and the like, and a specific value of the target sampling rate is not limited in this embodiment. The speech signal represents a clean speech signal or a sound signal doped with a less noisy signal. By one approach, the voice signal may be derived from a piece of audio, such as a piece of singing voice, a piece of recorded voice, etc.; alternatively, the voice signal may be derived from video, that is, the voice signal may be a sound signal intercepted from video, and in particular, the source of the voice signal is not limited.
Alternatively, the first audio in the embodiment of the present application may be singing voice (the sampling rate is typically 44.1 kHz).
The background noise signal refers to a sound signal which generates interference to a speech signal, the background noise signal may be derived from electromagnetic interference of sound or ambient environment, and the background noise may cause performance of many speech processing systems to be drastically degraded, which greatly affects user experience. It can be understood that the first audio may inevitably have a background noise signal, and in order to reduce the influence of the background noise signal on the speech signal and improve the user experience, the embodiment may acquire the first audio and perform corresponding processing on the first audio to reduce the background noise signal of the first audio.
Optionally, in order to improve the speech system function of the electronic device, the electronic device may monitor the audio signal in real time, and in this case, the electronic device may recognize any piece of audio (including audio data in the video) as the first audio, so that the background noise of the first audio may be reduced in real time.
The electronic device may acquire the first audio in a plurality of ways.
As one mode, the electronic device may obtain, through the audio system program, audio data of a third-party client program including the audio data, and further obtain the first audio. For example, the game audio generated by the game application in the running process, the singing audio generated by the singing application in the running process, the video playing sound effect generated by the video playing application in the running process, or the starting audio generated by the electronic device in the starting process are acquired through the audio system program, and optionally, the audio can be used as the first audio, so that the first audio is acquired.
Alternatively, the electronic device may obtain audio data as the first audio from the internet in real time, for example, dubbing an advertisement of a certain website as the first audio. Optionally, the electronic device may also use the remotely downloaded audio data as the first audio, or record a segment of the user's voice as the first audio. The source and format of the first audio are not limited, and are not listed here.
Step S120: and preprocessing the first audio to obtain Bark characteristics of the first audio.
As one way, the preprocessing in the embodiment of the present application may refer to converting the first audio from a linear time domain dimension to a frequency domain dimension for processing. Specifically, the spectral characteristics of the first audio are converted from a linear time domain dimension to a Bark frequency domain, so as to obtain Bark characteristics of the first audio. Where the spectral features include spectral energy features and spectral amplitude features of the first audio, optionally the values of the spectral energy features are equal to the square of the values of the spectral amplitude features, then it will be understood that Bark features of the first audio can be understood as a representation of the spectral energy features of the first audio in the Bark frequency domain.
It can be understood that the speech signal of the first audio is a non-stationary speech signal, and the degree of linearity of the Bark feature can be reduced by preprocessing the first audio, so that the background noise signal of the first audio can be more effectively removed after the Bark feature is input into the pre-trained target neural network model.
Step S130: inputting the Bark characteristics into a target neural network model obtained by pre-training, and acquiring Bark characteristic proportion parameters output by the target neural network model.
From the foregoing, the target neural network model in the embodiment of the present application includes at least one split gate convolution layer and at least one long-short term memory layer. The specific number and arrangement order of the split gate convolution layers and the long and short term memory layers are not limited in the embodiments of the present application, and may be set according to actual situations.
For example, as one approach, the target neural network model may include 3 split gate convolutional layers and two long-short term memory layers. In this case, in order to obtain a better noise reduction effect, the implementation designs a loss function and adopts an adaptive time estimation method (ADAM), and the loss function can reduce distortion of the amplitude of the speech signal of the first audio frequency after the Bark feature is input to the target neural network model; according to the network structure comprising the 3 split gate convolution layers and the two long and short term memory layers, the target neural network model can learn the newly acquired first audio by combining an adaptive time estimation method, so that the background noise signal in the first audio is well reduced. Specifically, the adaptive time estimation method according to the embodiment of the present application uses a momentum factor BETA1 of 0.9, a momentum factor BETA2 of 0.999, a basic LEARNING RATE (LEARNING _ RATE) of 0.001, and the LEARNING RATE decreases to 0.3 for each 300000 increase of the number of iterations. In this embodiment, the training BATCH SIZE (BATCH _ SIZE) is set to 32, i.e. 32 training tones are input during one iteration of network training, and the samples can be repeatedly extracted. Finally, train around 1000000 times, make Loss converge to near the minimum.
Optionally, the Bark characteristic proportion parameter may represent a proportion of the spectral amplitude characteristic of the speech signal in the Bark frequency domain, that is, a proportion of the spectral amplitude characteristic of the speech signal in the spectral amplitude characteristic of the first audio (including the speech signal and the background noise signal). As one mode, the Bark feature of the first audio is input into the separation gate convolution layer, and then the output of the separation gate convolution layer is input into the long-term and short-term memory layer, so that Bark feature proportion parameters can be output and obtained (the target neural network model can automatically learn the Bark feature proportion parameters of the voice signals).
For example, in a specific application scenario, assuming that a section of noisy audio (i.e. a first audio) has a spectral amplitude of 1, and the entire audio is composed of speech (i.e. a speech signal) with a spectral amplitude of 0.8 and noise (i.e. a background noise signal) with a spectral amplitude of 0.2, the Bark feature corresponding to the noisy audio is input to the target neural network model, and the target neural network model can output a speech signal with a spectral amplitude of 0.8, i.e. the target neural network model can "pick" the speech signal from noisy Bark features, so as to obtain a Bark feature proportion parameter (here, 0.8) of the speech signal.
Step S140: and calculating the amplitude parameter of the voice signal based on the Bark characteristic proportion parameter.
Then, in the above case, the amplitude parameter of the speech signal can be calculated based on the Bark feature scale parameter.
Wherein the amplitude parameter represents a spectral amplitude parameter of the speech signal. Specifically, the method comprises the spectrum amplitude proportion of the speech signal in a linear time domain, the spectrum amplitude and the spectrum amplitude proportion in a Bark frequency domain. By calculating the amplitude parameter of the voice signal, the voice signal after noise reduction can be converted from Bark frequency domain to linear time domain, thereby reducing the waveform of the voice signal to the linear time domain so as to output the voice signal.
Step S150: and acquiring target voice based on the amplitude parameter.
The target voice refers to a voice signal obtained by reducing noise of the first audio. Optionally, after the amplitude parameter is calculated based on the Bark characteristic scale parameter, the target voice may be obtained based on the amplitude parameter, that is, the voice signal obtained by removing the noise of the first audio is obtained.
The method for reducing noise of voice provided by this embodiment includes inputting a Bark feature of a first audio frequency mixed with a voice signal and a background noise signal into a target neural network model obtained through pre-training, selecting the Bark feature representing the voice signal to obtain a Bark feature proportion parameter representing a proportion of a spectrum amplitude feature of the voice signal in a Bark frequency domain, calculating an amplitude parameter of the voice signal based on the Bark feature proportion parameter, obtaining a target voice (i.e., the voice signal after the background noise signal in the first audio frequency is eliminated) based on the amplitude parameter, achieving a purpose of reducing noise, and reducing a calculation amount of the neural network model by using a mode of directly judging and screening the voice signal in the first audio frequency through the target neural network model.
Referring to fig. 4, a flowchart of a method for reducing noise of a voice according to another embodiment of the present application is shown, where the embodiment provides a method for reducing noise of a voice, which can be applied to the electronic device, and the method includes:
step S210: a training sample set is obtained.
It should be noted that, in the embodiment of the present application, a target neural network model capable of recognizing a speech signal and further reducing noise may be trained in advance through an acquired training sample set, and a speech signal may be obtained by better filtering a noise signal in a noisy audio frequency through the model.
The training sample set of the embodiment of the application comprises a voice signal with preset duration and a background noise signal. Optionally, the preset duration may be any continuous or discontinuous duration, and the preset duration of the speech signal and the preset duration of the background noise signal may be equal or unequal. For example, the preset duration of the voice signal may be 20 hours, and the preset duration of the background noise signal may be 10 hours; or the preset duration of the voice signal and the preset duration of the background noise signal are both 15 hours, and the like, which is not limited specifically.
Optionally, the target singing voice with different timbres in the continuous preset time length may be used as the voice signal with the preset time length, or the target singing voice with different timbres in the discontinuous preset time length (that is, the preset time length is discontinuous) may be used as the voice signal with the preset time length.
Similarly, different types of background noise within a continuous preset time period may be used as the background noise signal of the preset time period, or different types of background noise within a discontinuous preset time period may be used as the background noise signal of the preset time period.
As one mode, the preset duration may be obtained according to a preset obtaining mode, for example, the preset duration is obtained according to an integral multiple of an hour; optionally, the voice signal and the background noise signal may also be randomly acquired, and the acquisition durations of the voice signal and the background noise signal which are respectively acquired are used as respective preset durations.
In one implementation, the electronic device may obtain audio with time sequence selected by a user as a voice signal and a background noise signal of a preset duration; audio data can also be randomly captured from the network as a voice signal with preset duration and a background noise signal; or the audio data in the running process of the audio application program of the electronic equipment is used as a voice signal with preset duration and a background noise signal. It should be noted that the obtaining manner and the obtained content source of the voice signal and the background noise signal with the preset duration are not limited, and may be selected according to the actual situation.
Step S220: and superposing the voice signal and the background noise signal on a linear time domain according to a preset signal-to-noise ratio, inputting the superposed training sample set into a machine learning model, and training the machine learning model to obtain a target neural network model.
It is understood that any piece of speech data that has not undergone any noise reduction processing cannot avoid the presence of background noise, i.e., the presence of a signal-to-noise ratio. SIGNAL-to-NOISE RATIO (SNR), also known as SIGNAL-to-NOISE RATIO, refers to the RATIO of SIGNAL to NOISE in an electronic device or system. It can be understood that, in order to increase the noise reduction accuracy of the target neural network model in the embodiment of the present application, so that the noise reduction algorithm adapts to audio data with different signal-to-noise ratios, a speech signal with a preset duration and a background noise signal may be superimposed on a linear time domain according to the preset signal-to-noise ratio, and a training sample set after the superimposition is input to the machine learning model, so as to train the machine learning model, thereby obtaining the target neural network model.
Optionally, the preset signal-to-noise ratio may be a random number between 0 and 20, and the specific numerical value is not limited.
The machine learning model can be a linear model, a kernel method and a support vector machine, a decision tree and Boosting, a neural network (including a fully-connected neural network, a convolutional neural network, a cyclic neural network, etc.), and the like. The specific training mode of each machine learning model may refer to respective working principles in the prior art, and is not described herein again.
It should be noted that, when the speech signal and the background noise signal are superimposed on the linear time domain according to the preset signal-to-noise ratio, the preset duration of the speech signal and the background noise signal in the training sample set is equal, for example, 2.5 seconds of speech signal and 2.5 seconds of background noise signal are selected from the training sample set, so that the neural network model obtained by training can adapt to noisy audio with more signal-to-noise ratios.
Step S230: a first audio is acquired.
For a specific description of obtaining the first audio, reference may be made to the description of step S110 in the foregoing embodiment, and details are not described herein again.
Step S240: the first audio signal is frame windowed.
Since the first audio signal is a non-stationary signal, it needs to be framed and windowed. As one mode, a Hanning Window (Hanning Window) is adopted in the embodiment of the present application, and the Window length is set to be 40ms (milliseconds), and the sliding Window is 10 ms. The window function adopted in the present embodiment is not particularly limited, and may be another window function, such as a triangular window function.
In a specific application scenario, if the audio sampling rate of the speech signal is 44.1kHz, the window length of the hanning window is 1764 audio frequency points, and the sliding window is 441 audio frequency points. Optionally, the window length is set in this way, so that the overall operation speed of the target neural network model can be increased on the premise of ensuring no distortion of the voice signal. By framing and windowing the first audio signal, inter-frame discontinuities can be avoided.
Step S250: and carrying out short-time Fourier transform on the first audio signal in each window to obtain Bark characteristics of the first audio.
Optionally, the first audio signal in each window is subjected to short-time fourier transform to convert the spectral energy characteristic of the first audio from the linear time domain to the Bark frequency domain, thereby obtaining the Bark characteristic of the first audio. Specifically, in the embodiment of the present application, the number of points in the short-time fourier transform is set to 2048, and then 1025 frequency-dimensional values (i.e., stft values) can be obtained after the short-time fourier transform. The dimension of the Bark feature taken in this embodiment is 48 dimensions, then the dimension of the transform matrix stft _ energy to Bark feature stft2Bark _ matrix is 1025 x 48.
It should be noted that, while performing short-time fourier transform on the signal in each window, the phase value of the first audio is also calculated, and the specific calculation formula is as follows:
wherein arctan is an arctangent function.
Step S260: and inputting the Bark characteristics into a target neural network model obtained by pre-training to obtain the spectrum amplitude proportion of the voice signal in the Bark frequency domain.
By way of example, as previously described, the target neural network model in the embodiments of the present application may include three split-gate convolutional layers and two long-short term memory layers. When the Bark characteristics are input into a target neural network model obtained by pre-training, the Bark characteristics are firstly input into a separation gate convolution layer, then the output of the separation gate convolution layer is input into a long-short term memory layer, and then the spectrum amplitude ratio (Bark _ mask) of a voice signal in a Bark frequency domain is obtained through output.
Wherein the processing step of each split-gate convolutional layer after the Bark feature is input into the split-gate convolutional layer may include: inputting input data (for the first split-gate convolutional layer, the input data is a Bark feature) into a first causal convolutional layer; inputting the output of the first causal convolutional layer to a second causal convolutional layer; then the output of the second causal convolutional layer is respectively input into a third convolutional layer and a fourth convolutional layer; then inputting the output of the third convolution layer to the first activation function module and inputting the output of the fourth convolution layer to the second activation function module; and multiplying the output of the first activation function module and the output of the second activation function module to obtain the output of the rolling layer of the split gate.
Step S270: and carrying out Bark characteristic inverse conversion on the Bark frequency domain spectrum amplitude ratio to obtain the spectrum amplitude ratio of the voice signal in a linear time domain.
As one approach, the following can be expressed by the formula:
mask=mat_mul(bark_mask,bark2stft_matrix),
and carrying out Bark characteristic inverse transformation on the Bark frequency domain spectrum amplitude ratio (Bark _ mask), wherein the dimension of a Bark characteristic inverse transformation matrix is 25 x 1025, and obtaining the spectrum amplitude ratio (mask) of the speech signal in a linear time domain. The sound wave of the voice signal can be conveniently synthesized by converting the frequency spectrum amplitude ratio to the linear time domain, and then the voice signal effect after noise reduction is checked.
Step S280: and calculating the spectral amplitude of the voice signal based on the spectral amplitude proportion of the linear time domain and the spectral energy of the first audio in the linear time domain.
The spectral energy of the first audio in the linear time domain can be calculated by the following formula:
as one mode, the spectral amplitude of the speech signal can be calculated by using the spectral amplitude ratio (mask) of the first audio in the linear time domain and the spectral energy stft _ mag of the first audio in the linear time domain, where the specific calculation formula is as follows:
step S290: and acquiring target voice based on the spectrum amplitude value and the phase value of the first audio.
As one mode, the phase value and the spectral amplitude may be subjected to inverse fourier transform to obtain the target speech.
The present embodiment is exemplarily described below by taking fig. 5 as an example:
fig. 5 is a schematic flow chart of a method for reducing noise in a noisy song according to an embodiment of the present application. Alternatively, the speech signal is a singing voice signal, and the first audio is a noisy singing voice. Carrying out short-time Fourier transform on the noisy singing voice to obtain the spectral energy (stft _ energy) of the noisy singing voice in a linear time domain, carrying out Bark characteristic conversion on the spectral energy of the noisy singing voice to obtain the Bark characteristic (Bark _ feature) of the noisy singing voice in a Bark frequency domain, inputting the Bark characteristic into a pre-trained neural network model, the neural network model comprises 3 separate gate convolution layers and two long and short term memory layers, the spectral amplitude proportion (Bark _ mask) of the noise singing voice signal in Bark spectrum is output, then, the spectral amplitude ratio (Bark _ mask) is subjected to characteristic inverse transformation to obtain the spectral amplitude ratio (mask) of the singing voice signal in a linear time domain, and the spectral amplitude (stft _ mag) of the singing voice signal is calculated based on the spectral amplitude ratio (mask) and the spectral energy (stft _ energy) of the singing voice with noise in the linear time domain, which is calculated when the noise singing voice is subjected to inverse Fourier transformation.
It should be noted that after the short-time fourier transform of the noisy singing voice, the phase (stft _ phase) of the noisy singing voice in the linear time domain is obtained, so that the waveform of the noisy singing voice signal in the linear time domain after noise reduction can be synthesized according to the spectral amplitude (stft _ mag) of the singing voice signal and the phase (stft _ phase) of the noisy singing voice in the linear time domain, so as to obtain the singing voice signal, such as the target singing voice shown in fig. 5, which obviously reduces the background noise compared with the noisy singing voice.
The method for reducing noise of voice provided by this embodiment includes obtaining a training sample set, superimposing a voice signal and a background noise signal in the training sample set on a linear time domain according to a preset signal-to-noise ratio, inputting the superimposed training sample set to a machine learning model, training the machine learning model to obtain a target neural network model, obtaining a first audio frequency, performing framing and windowing on the first audio frequency, performing short-time fourier transform on the first audio signal in each window to obtain a Bark characteristic of the first audio frequency, inputting the Bark characteristic to the target neural network model obtained through pre-training to obtain a spectral amplitude ratio of the voice signal in a Bark frequency domain, performing inverse transformation on the spectral amplitude ratio of the Bark frequency domain to obtain a spectral amplitude ratio of the voice signal in the linear time domain, and calculating the spectral amplitude ratio of the voice signal based on the spectral amplitude ratio of the linear time domain and spectral energy of the first audio frequency in the linear time domain And finally, acquiring the target voice based on the spectral amplitude value and the phase value of the first audio. The input Bark characteristics are processed through a brand-new separation gate convolution structure, so that the noise reduction effect is guaranteed, the calculated amount and complexity of a neural network are greatly reduced, and the user experience is improved.
Referring to fig. 6, which is a block diagram of a voice denoising apparatus according to an embodiment of the present disclosure, in this embodiment, a voice denoising apparatus 300 is provided, which is operated in an electronic device, where the apparatus 300 includes: the first obtaining module 310, the preprocessing module 320, the first calculating module 330, the second calculating module 340, and the second obtaining module 350:
the first obtaining module 310 is configured to obtain a first audio, where the first audio is an audio mixed with a speech signal and a background noise signal.
As an approach, the apparatus 300 may further include a sample set obtaining unit and a model obtaining unit, where the sample set obtaining unit may be configured to obtain a training sample set, and the training sample set may include a speech signal with a preset duration and a background noise signal, and the speech signal and the background noise signal are superimposed on a linear time domain according to a preset signal-to-noise ratio. The model acquisition unit is used for inputting the training sample set into the machine learning model and training the machine learning model to obtain the target neural network model.
A preprocessing module 320, configured to preprocess the first audio to convert the spectral energy characteristic of the first audio from a linear time domain to a Bark frequency domain, so as to obtain a Bark characteristic of the first audio.
By one approach, the pre-processing module 320 may include a first processing unit and a second processing unit. The first processing unit may be configured to frame and window the first audio signal; the second processing unit may be configured to perform a short-time fourier transform on the first audio signal in each window to convert the spectral energy characteristic of the first audio from a linear time domain to a Bark frequency domain, resulting in a Bark characteristic of the first audio.
Optionally, the preprocessing module 320 may include a calculation unit for calculating a phase value of the first audio.
A first calculating module 330, configured to input the Bark feature into a pre-trained target neural network model, and obtain a Bark feature proportion parameter output by the target neural network model, where the Bark feature proportion parameter represents a proportion of a spectral amplitude feature of the speech signal in the Bark frequency domain.
As one mode, the first calculating module 330 may specifically be configured to input the Bark feature into a pre-trained target neural network model, so as to obtain a spectral amplitude ratio of the speech signal in the Bark frequency domain.
Optionally, the target neural network model in the embodiment of the present application may include three split gate convolution layers and two long-short term memory layers.
A second calculating module 340, configured to calculate an amplitude parameter of the speech signal based on the Bark feature scale parameter.
By one approach, the second calculation module 340 may include a first calculation unit and a second calculation unit. The first calculating unit can be used for carrying out Bark characteristic inverse conversion on the Bark frequency domain spectrum amplitude proportion to obtain the spectrum amplitude proportion of the voice signal in the linear time domain; the second calculation unit may be configured to calculate a spectral amplitude of the speech signal based on a spectral amplitude ratio of the linear time domain and a spectral energy of the first audio in the linear time domain.
A second obtaining module 350, configured to obtain the target speech based on the magnitude parameter.
By one approach, the second obtaining module 350 may obtain the target speech based on the spectral magnitude and phase values. Specifically, the second obtaining module 350 may perform inverse fourier transform on the phase value and the spectrum amplitude value to obtain the target voice.
According to the device for reducing the noise of the voice, the first audio is obtained, and the first audio is the audio mixed with the voice signal and the background noise signal; then preprocessing the first audio to convert the spectral energy characteristic of the first audio from a linear time domain to a Bark frequency domain to obtain a Bark characteristic of the first audio; inputting the Bark characteristics into a target neural network model obtained by pre-training to obtain Bark characteristic proportion parameters output by the target neural network model, wherein the Bark characteristic proportion parameters represent the proportion of the frequency spectrum amplitude characteristics of the voice signals in a Bark frequency domain; calculating the amplitude parameter of the voice signal based on the Bark characteristic proportion parameter; the target speech is then obtained based on the magnitude parameter. According to the method, the Bark characteristics of the first audio are input into the target neural network model obtained through pre-training, the amplitude parameter of the voice signal is further calculated, the target voice is obtained based on the amplitude parameter, the calculated amount of the neural network model is reduced, and the background noise in the voice information is reduced.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, the coupling or direct coupling or communication connection between the modules shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be in an electrical, mechanical or other form.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
Referring to fig. 7, based on the method and the apparatus for speech noise reduction, an embodiment of the present application further provides an electronic device 12 capable of executing the method for speech noise reduction. The electronic device 12 includes a memory 122 and one or more processors 124 (only one shown) coupled to each other, with the memory 122 and the processors 124 being communicatively coupled together. The memory 122 stores therein a program that can execute the contents of the foregoing embodiments, and the processor 124 can execute the program stored in the memory 122.
Processor 124 may include one or more processing cores, among others. Processor 124 interfaces with various components throughout electronic device 12 using various interfaces and circuitry to perform various functions of electronic device 12 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in memory 122 and by invoking data stored in memory 122. Alternatively, the processor 124 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 124 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 124, but may be implemented by a communication chip.
The Memory 122 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 122 may be used to store instructions, programs, code sets, or instruction sets. The memory 122 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the foregoing embodiments, and the like. The stored data area may also store data created during use by the electronic device 12 (e.g., phone books, audio-visual data, chat log data), and the like.
Referring to fig. 8, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable storage medium 400 has stored therein program code that can be called by a processor to execute the methods described in the above-described method embodiments.
The computer-readable storage medium 400 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer-readable storage medium 400 includes a non-transitory computer-readable storage medium. The computer readable storage medium 400 has storage space for program code 410 for performing any of the method steps of the method described above. The program code can be read from or written to one or more computer program products. Program code 410 may be compressed, for example, in a suitable form.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.