CN113299302A

CN113299302A - Audio noise reduction method and device and electronic equipment

Info

Publication number: CN113299302A
Application number: CN202110436802.7A
Authority: CN
Inventors: 王少华; 杨闳博
Original assignee: Vivo Mobile Communication Hangzhou Co Ltd
Current assignee: Vivo Mobile Communication Hangzhou Co Ltd
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-08-24

Abstract

The application discloses an audio noise reduction method and device and electronic equipment, and belongs to the technical field of communication. The method comprises the following steps: acquiring a first audio signal, wherein the first signal comprises a voice signal and a noise signal; carrying out pre-noise reduction processing on the first audio signal to obtain a second audio signal, wherein the signal-to-noise ratio of the second audio signal is greater than that of the first audio signal; inputting the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, wherein the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal; and performing noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal.

Description

Audio noise reduction method and device and electronic equipment

Technical Field

The embodiment of the application relates to the technical field of communication, in particular to an audio noise reduction method and device and electronic equipment.

Background

With the development of electronic technology, electronic devices have multiple functions of receiving and transmitting audio signals, playing audio signals, and the like, and performing noise reduction processing on audio signals becomes a common processing means in order to ensure the quality of audio signals.

Currently, digital filtering, which is a common signal processing technique (e.g., adaptive filtering, wavelet transform filtering, etc.), can be used to denoise audio signals. The digital filtering technique mainly uses the difference of frequency spectrum characteristics to suppress interference waves so as to highlight effective waves. However, in practical use, the signal-to-noise ratio of the noisy speech signal collected by the microphone is uncertain, that is, the signal-to-noise ratio of the noisy speech signal is high, and the signal-to-noise ratio of the noisy speech signal is low. Thus, when the digital filtering technique is used, the discrimination between the noise spectrum characteristic and the speech spectrum characteristic in the noisy speech signal is low, which may result in poor noise reduction effect on the audio signal.

Disclosure of Invention

The embodiment of the application aims to provide an audio noise reduction method, an audio noise reduction device and electronic equipment, and the problem that the noise reduction effect of an audio signal is poor can be solved.

In a first aspect, an embodiment of the present application provides an audio noise reduction method, where the method includes: acquiring a first audio signal, wherein the first signal comprises a voice signal and a noise signal; carrying out pre-noise reduction processing on the first audio signal to obtain a second audio signal, wherein the signal-to-noise ratio of the second audio signal is greater than that of the first audio signal; inputting the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, wherein the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal; and performing noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal.

In a second aspect, an embodiment of the present application provides an audio noise reduction apparatus, including: the device comprises an acquisition module, a first noise reduction module, a processing module and a second noise reduction module. The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a first audio signal, and the first signal comprises a voice signal and a noise signal; the first noise reduction module is used for carrying out pre-noise reduction processing on the first audio signal acquired by the acquisition module to obtain a second audio signal, and the signal-to-noise ratio of the second audio signal is greater than that of the first audio signal; the processing module is used for inputting the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal; and the second noise reduction module is used for carrying out noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal.

In a third aspect, embodiments of the present application provide an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, where the program or instructions, when executed by the processor, implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium on which a program or instructions are stored, which when executed by a processor, implement the steps of the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a chip, where the chip includes a processor and a communication interface, and the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, a first audio signal is obtained, wherein the first signal comprises a voice signal and a noise signal; carrying out pre-noise reduction processing on the first audio signal to obtain a second audio signal, wherein the signal-to-noise ratio of the second audio signal is greater than that of the first audio signal; inputting the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, wherein the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal; and performing noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal. According to the method, on one hand, the audio signal is subjected to pre-noise reduction to improve the signal-to-noise ratio of the audio signal, so that the characteristics of the voice signal in the audio signal are more prominent, and the accuracy of the trained target deep learning network model is improved; on the other hand, the ideal mask of the second audio signal calculated by the target deep learning network model is used for carrying out noise reduction processing on the second audio signal again, so that the noise reduction effect on the audio signal is further improved.

Drawings

Fig. 1 is a schematic diagram of an audio noise reduction method according to an embodiment of the present application;

fig. 2 is a second schematic diagram of an audio noise reduction method according to an embodiment of the present application;

fig. 3 is a third schematic diagram of an audio noise reduction method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio noise reduction apparatus according to an embodiment of the present disclosure;

fig. 5 is a hardware schematic diagram of an electronic device according to an embodiment of the present disclosure;

fig. 6 is a second hardware schematic diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or described herein. The objects distinguished by "first", "second", and the like are usually a class, and the number of the objects is not limited, and for example, the first object may be one or a plurality of objects. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The audio denoising method provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

The audio noise reduction method provided by the embodiment of the application can be applied to any one of the following scenes: in a first scene, a user uses electronic equipment to carry out voice call; a second scenario is that a user uses electronic equipment to receive and send a pre-judging signal; and in the third scene, the user uses the electronic equipment to play audio.

The electronic equipment can acquire a first audio signal comprising a voice signal and a noise signal, and perform pre-noise reduction processing on the first audio signal to obtain a second audio signal, wherein the signal-to-noise ratio of the second audio signal is greater than that of the first audio signal; then, inputting the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, wherein the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal; and then, according to the ideal mask of the second audio signal, carrying out noise reduction processing on the second audio signal to obtain a target audio signal. According to the method, on one hand, the audio signal is subjected to pre-noise reduction to improve the signal-to-noise ratio of the audio signal, so that the characteristics of the voice signal in the audio signal are more prominent, and the accuracy of a trained target deep learning network model is improved; on the other hand, the ideal mask of the second audio signal calculated by the target deep learning network model is used for carrying out noise reduction processing on the second audio signal again, so that the noise reduction effect on the audio signal is further improved.

As shown in fig. 1, an embodiment of the present application provides an audio noise reduction method, which may include steps 101 to 104 described below.

Step 101, the electronic device acquires a first audio signal.

Wherein the first signal comprises a voice signal and a noise signal.

Optionally, in this embodiment of the application, the first audio signal is a noisy speech signal, and specifically, the noisy speech signal may be a noisy speech signal acquired by a microphone, or an artificial noisy speech signal obtained by superimposing a pure speech signal and a noise signal by using an electronic device. Wherein the simulated noisy speech signal is more focused on use during the training process.

Optionally, in this embodiment of the application, before step 101, the electronic device may perform signal simulation or superposition on a voice signal (e.g., a historical call signal, a pure user voice signal, etc.) and a noise signal (e.g., various collected environmental noise signals, or a white noise signal randomly generated by the electronic device, etc.) stored by the electronic device to obtain a noisy voice signal, which is used as the first audio signal. Accordingly, the mode of generating the first audio signal can definitely control the signal-to-noise ratio of the generated first audio signal, specifically obtain the speech signal and the noise signal, and can be used for verifying the accuracy of the trained target deep learning network model in the subsequent steps.

Optionally, in this embodiment of the application, the first audio signal is generally a frequency domain signal of a noisy speech signal. If the first signal obtained in the actual using process is a time-domain signal, a frequency-domain signal thereof may be obtained according to fourier transform, which may specifically refer to related technologies, and details thereof are not described here.

For example, it is assumed that the electronic device can randomly acquire a pure speech signal s (n) and a noise signal v (n), where n represents time. The electronic device may superimpose the noisy speech signal Y (n) ═ s (n) + v (n), and then may perform fourier transform on Y (n) to obtain Y (m, k) as the first audio signal, where m represents time, k represents frequency points, and m and k are both positive integers.

In the embodiments of the present application, the audio signals in the following embodiments are all frequency domain signals without being specifically described, that is, the audio signals are processed and operated on the frequency domain.

And 102, the electronic equipment performs pre-noise reduction processing on the first audio signal to obtain a second audio signal.

Wherein, the signal-to-noise ratio of the second audio signal is greater than that of the first signal;

optionally, in this embodiment of the application, the pre-noise reduction processing of the electronic device on the first audio signal aims to improve the signal-to-noise ratio of the audio signal, so that the characteristics of the voice signal in the audio signal are more prominent, and on one hand, the training effect and accuracy are improved when the target deep learning network model is trained; on the other hand, in the actual judgment process, the characteristics of the voice signal are more vivid, so that the processing speed of using the target deep learning network model can be improved.

Alternatively, the following embodiments may pre-denoise the first signal by performing stationary noise floor estimation on the first audio signal to a second audio signal with a higher signal-to-noise ratio.

Optionally, in this embodiment of the application, the step 102 may be specifically implemented by the following steps 102a to 102 c.

And 102a, the electronic equipment carries out stationary background noise estimation on the first audio signal to obtain a first background noise.

Optionally, in this embodiment of the application, the electronic device may determine the first noise floor by using minimum tracking, histogram, and the like for the first audio signal. For specific operations, reference may be made to related technologies, which are not described herein in detail.

It should be noted that the first noise floor is a stationary noise floor, the first noise floor is used to characterize a noise reduction standard determined after the stationary noise floor estimation is performed on the first audio signal, and in the subsequent steps, a corresponding gain value (e.g., a first gain) may be calculated through the first noise floor, and the first audio signal is subjected to pre-noise reduction processing through the gain value. I.e., the noise reduction in step 102 is based on the noise reduction process of the stationary noise floor.

And 102b, the electronic equipment calculates a first gain according to the first audio signal and the first background noise.

Optionally, in this embodiment of the application, the specific method for calculating the first gain may be: firstly, calculating to obtain a posterior signal-to-noise ratio through a first background noise and a first audio signal, and further determining a prior signal-to-noise ratio; the electronic device then determines a first gain using a wiener filtering method using the a priori signal-to-noise ratio and the a posteriori signal-to-noise ratio.

It should be noted that, in the embodiment of the present application, the first gain is used to represent the degree of noise reduction processing performed on the first audio signal, that is, when the first gain value is larger, the noise reduction effect is theoretically better. However, in the actual use process, it is also necessary to consider the energy spectrum of the second audio signal obtained after the first audio signal is processed, the integrity of the speech signal in the second audio signal, and the like.

And 102c, the electronic equipment performs pre-noise reduction processing on the first signal according to the first gain to obtain a second audio signal.

Optionally, in this embodiment of the application, the electronic device may perform noise reduction processing such as filtering, correction compensation, and the like on the first signal according to the first gain to obtain the second audio signal. The first audio signal is the first audio signal after the pre-noise reduction processing, and compared with the first audio signal, the characteristics of the voice signal in the second audio signal are more prominent, so that the training learning or result verification is facilitated.

Specifically, the noise reduction of the first signal according to the first gain to obtain the second audio signal may be determined by the following formula:

wherein Y (m, k) represents a first audio signal, G_s(m, k) represents a first gain,

and the second audio signal subjected to the pre-noise reduction processing is represented, m represents time, k represents frequency points, and m and k are positive integers.

It can be understood that, because the electronic device can determine the first background noise by a stable background noise estimation mode, and further determine the first gain, and perform pre-noise reduction on the first audio signal according to the first gain to obtain the second audio signal with higher signal-to-noise ratio, so that the characteristics of the voice signal in the second audio signal are more prominent, and on one hand, the training effect and accuracy are improved when the target deep learning network model is trained; on the other hand, in the process of judging the actual use target deep learning network model, the method is quicker, the processing speed of the electronic equipment is improved, and the time of a user is saved.

And 103, inputting the second audio signal into the target deep learning network model to obtain an ideal mask of the second audio signal.

The target deep learning network model is obtained by training a third audio signal and an ideal mask of the third audio signal, wherein the third audio signal is a signal obtained from a second audio signal.

Optionally, in this embodiment of the application, the third audio signal may specifically be a part of the audio signal intercepted from the second audio signal, and the part of the audio signal may be multiple, and specifically may be determined according to the number of the required training samples.

It should be noted that, in this embodiment of the application, the target deep learning network model is a trained neural network, and the electronic device may input a second audio signal to the target deep learning network model to directly obtain an ideal mask of the second audio signal.

Optionally, in the embodiment of the present application, since the deep learning network model selected by the present application is used for data fitting of the nonlinear relationship, the deep learning network model can be selected according to actual use requirements: at least one of a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a multi-layer perceptron (MLP), a BP neural network (BP), and a long-short term memory (LSTM) is trained, and the specific training method may refer to the following specific description in steps 105 to 106, and then directly call the trained target deep learning network model as a module.

In addition, in this embodiment of the application, the third audio signal is specifically a partial signal cut from the second audio signal, and the ideal mask corresponding to the partial signal is an ideal mask of the third audio signal. Namely, the target deep learning network model is trained by the third audio signal and the corresponding ideal mask of the third audio signal.

Specifically, the ideal mask may be calculated as the ideal mask IRM (m, k) according to the following calculation formula:

s (m, k) represents a pure voice signal of a frequency domain, and can be obtained by carrying out Fourier transform on a pure voice signal S (n) of a time domain;

In addition, S (m, k) represents a pure speech signal in the frequency domain, and the ideal mask of the third audio signal may be calculated sequentially for each third audio signal in the calculation.

And step 104, the electronic equipment performs noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal.

Optionally, in this embodiment of the application, the electronic device may perform noise reduction processing on the second audio signal again by using the ideal mask of the second audio signal calculated by the target deep learning network model, so as to further improve the noise reduction effect on the audio signal.

Optionally, in this embodiment of the application, the method for obtaining the target audio signal may specifically be: firstly, the electronic device performs noise floor estimation on the second audio signal according to the ideal mask of the second audio signal to obtain a noise floor (for example, the following second noise floor) reflecting a stationary noise floor and a non-stationary noise floor; then, the electronic device calculates a gain value (e.g., a second gain) based on the noise floor; then, the electronic device may perform noise reduction processing on the second signal according to the gain value to obtain a target audio signal. For a specific implementation, reference may be made to the following detailed description in steps 104a to 104c, which is not repeated herein.

The embodiment of the application provides an audio noise reduction method, which can acquire a first audio signal, wherein the first audio signal comprises a voice signal and a noise signal; carrying out pre-noise reduction processing on the first audio signal to obtain a second audio signal, wherein the signal-to-noise ratio of the second audio signal is greater than that of the first audio signal; inputting the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, wherein the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal; and performing noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal. According to the method, on one hand, the audio signal is subjected to pre-noise reduction to improve the signal-to-noise ratio of the audio signal, so that the characteristics of the voice signal in the audio signal are more prominent, and the accuracy of the trained target deep learning network model is improved; on the other hand, the ideal mask of the second audio signal calculated by the target deep learning network model is used for carrying out noise reduction processing on the second audio signal again, so that the noise reduction effect on the audio signal is further improved.

Optionally, with reference to fig. 1, as shown in fig. 2, before step 103, the audio denoising method provided in the embodiment of the present application further includes the following step 105 and step 106.

Step 105, the electronic device obtains a training sample.

The training samples include M third audio signals and ideal masks of the M third audio signals, each third audio signal corresponds to an ideal mask of a third audio signal, the M third audio signals are all signals obtained from the second audio signal, and M is a positive integer.

Optionally, in this embodiment of the present application, the number of training samples, that is, the number of M, may be reasonably determined according to the number of the first audio signals. Considering that the more the number of training samples (i.e., the larger the number of M) in the actual use process, the longer the training time of the deep learning network model is, and the higher the accuracy is.

Preferably, M may range from 5000 or more, preferably 10000 or less. Of course, the actual use can also be determined by comprehensively considering the factors such as the requirement of the user on the training accuracy, the number of the first audio signals, the training time and the like.

Optionally, in this embodiment of the application, in an actual operation process, the frequency-domain spectral feature of the third audio signal and the corresponding ideal mask of the third audio signal may be used as training labels respectively, or used as training labels together for training. The following embodiments are exemplified together as training labels, and thus each training sample includes a third audio signal (which may be characterized by a frequency domain spectrum) and an ideal mask corresponding to the third audio signal.

And step 106, the electronic equipment trains the deep learning network model according to the training samples until the target evaluation condition is met, and the target deep learning network model is obtained.

Wherein the target evaluation condition includes an evaluation function constructed by an ideal mask of the third audio signal.

Optionally, in the embodiment of the present application, since the deep learning network model selected by the present application is used for data fitting of the nonlinear relationship, the deep learning network model can be selected according to actual use requirements: at least one of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), multi-layer perceptrons (MLPs), BP neural networks (BPs), and long-short term memories (LSTMs) is trained.

Specifically, after a user determines a network model (e.g., a CNN network model), first, the user may set an appropriate number of network layers and nodes (which may be determined according to training requirements, input label types, and the like), and then may select an appropriate activation function (e.g., sigmoid, tanh, and the like); then, an evaluation function is constructed (i.e., a loss function and a cost function are constructed from the ideal mask of the third audio signal).

Optionally, in this embodiment of the present application, the evaluation function may be: a loss function and/or a cost function. Since the cost function is the sum of all the loss functions, the accuracy of the deep learning network model can be evaluated more favorably, so that the loss function and the cost function can be constructed by adopting the ideal mask according to the third audio signal, and the evaluation of the ideal mask of the third audio signal is increased if the cost function adopts the mean square error.

Optionally, in this embodiment of the application, after determining the target evaluation condition, the user may set the target threshold as a training termination condition. That is, when the target evaluation condition (e.g., cost function) of the deep learning network model after training is less than or equal to the target threshold, the training is terminated, and the deep learning network model is used as the target deep learning network model.

Optionally, in this embodiment of the application, the training samples may be specifically divided into the following three sets in the training process: training set, testing set and verifying set. The training set is used for carrying out preliminary training on the deep learning network model; the test set is used for adjusting parameters of the preliminarily trained network model and carrying out secondary training on the preliminarily trained network model; and the network model of which the verification set is used for secondary training is verified by using the data in the verification set, and the target deep learning network model is obtained after the verification is passed.

Preferably, in the embodiment of the present application, a training sample distribution ratio suitable for the present application is as follows: and (3) test set: the verification set is 6:2: 2; or, training set: and (3) test set: the validation set was 6:3: 1.

Optionally, in the embodiment of the present application, in order to prevent data overfitting, that is, to prevent a phenomenon that although the training error is reduced to be low, the test error is relatively high, a user may appropriately modify the neural network model, for example, adjust the number of network layers and the number of nodes; and adjusting the distribution proportion of the training set, the test set and the verification set, and the like to correct.

It can be understood that the electronic device may acquire the third audio signal and an ideal mask of the third audio signal from the second audio signal as training samples, and construct a target evaluation condition based on the ideal mask of the third audio signal, so that the electronic device trains the deep learning network model according to the training samples until the target evaluation condition is met, thereby obtaining the target deep learning network model. Therefore, the accuracy of the obtained target deep learning network model is higher.

Optionally, in this embodiment of the application, the "training the deep learning network model according to the training sample" may be specifically implemented by the following steps 106a and 106 b.

Step 106a, for one training sample in the training samples, the electronic device extracts a frequency domain magnitude spectrum of the training sample, and obtains an ideal mask corresponding to the training sample.

And 106b, the electronic equipment trains the deep learning network model according to the frequency domain magnitude spectrum of the training sample and the ideal mask corresponding to the training sample.

It should be noted that, in the embodiment of the present application, for each training sample, the features used for inputting the deep learning network model are: a frequency domain magnitude spectrum of the training samples and an ideal mask of the training samples. The essence of the training is to fit a corresponding relationship between the frequency domain magnitude spectrum and the ideal mask (and the target deep learning network model as the training result is used to characterize the corresponding relationship), so that the electronic device directly uses the relationship in the subsequent step 103, that is, after the second audio signal is input to the target deep learning network model, the electronic device can directly obtain the ideal mask of the second audio.

Optionally, in this embodiment of the application, the frequency domain magnitude spectrum and the ideal mask correspond to each other one to one, that is, each training sample takes the frequency domain magnitude spectrum of the training sample and the ideal mask of the training sample as features, and is input to the deep learning network model for training.

For example, assuming that the selected network model is a convolutional neural network, each training sample includes a frequency-domain magnitude spectrum of one training sample and an ideal mask corresponding to the one training sample. Based on this, the electronic device inputs the training sample to the convolutional neural network for training, and after convergence, network parameters are obtained. The specific training process may include: randomly initializing network parameters, performing characteristic preprocessing (such as mean value zero, variance normalization and the like) on the frequency domain amplitude spectrum and the corresponding ideal mask, directionally calculating the network parameters according to the output error (obtained through a loss function), and updating the network parameters according to a gradient descent algorithm; in the case that the output error is smaller than or equal to the target threshold, the training is terminated, and the deep learning network model is used as a target deep learning network model, and the parameters at this time can be determined as the network parameters of the target deep learning network model after the training is completed.

It can be understood that, since the electronic device may train the deep learning network model using the frequency domain magnitude spectrum of each training sample and the corresponding ideal mask, in a case that an output error of a training result is less than or equal to the target threshold, the training may be terminated, and the deep learning network model may be used as the target deep learning network model, so that a target deep learning network model with higher accuracy may be obtained.

Alternatively, referring to fig. 1, as shown in fig. 3, the step 104 may be specifically realized by the following steps 104a to 104 c.

And step 104a, the electronic equipment performs background noise estimation on the second audio signal according to the ideal mask of the second audio signal to obtain a second background noise.

It should be noted that the second noise floor includes stationary noise floor and non-stationary noise floor, and is used to comprehensively evaluate the noise reduction standard of the second audio signal after noise floor estimation.

In addition, in the embodiment of the present application, the process of determining the second noise floor may refer to the specific description in the step 102a, which is not repeated herein.

Specifically, the electronic device obtains prior information of an ideal mask of the second audio signal, and then may control to update the noise estimate to obtain the second noise floor.

And step 104b, the electronic equipment calculates a second gain according to the second audio signal and the second background noise.

Optionally, in this embodiment of the application, the specific method for calculating the second gain may be: firstly, calculating to obtain a posterior signal-to-noise ratio through a second background noise and a second audio signal, and further determining a prior signal-to-noise ratio; the electronic device then determines a second gain using a wiener filtering method using the a priori signal-to-noise ratio and the a posteriori signal-to-noise ratio. Specifically, reference may be made to the detailed description in the step 102b, which is not repeated herein

It should be noted that, in the embodiment of the present application, the second gain is used to characterize the degree of noise reduction processing performed on the second audio signal.

And step 104c, the electronic equipment performs noise reduction processing on the second signal according to the second gain to obtain a target audio signal.

Optionally, in this embodiment of the application, the electronic device may perform noise reduction processing such as filtering, correction compensation, and the like on the second signal according to the second gain, and then obtain the target audio signal. Compared with the first audio signal, the target audio signal is an audio signal subjected to noise reduction processing twice, so that the characteristics of the voice signal in the target audio signal are more prominent, and a better noise reduction effect is achieved.

Specifically, the noise reduction of the second signal according to the second gain can be performed by the following formula:

wherein the content of the first and second substances,

representing a second audio signal, G (m, k) representing a second gain,

representing the object subjected to the secondary noise reduction processingMarking audio signals, wherein m represents time, k represents frequency points, and both m and k are positive integers.

Optionally, in this embodiment of the application, the electronic device may perform fourier transform on the obtained target audio signal (at this time, a frequency domain signal), so as to obtain a time domain signal of the target audio signal and output the time domain signal, that is, the enhanced time domain signal is output.

It can be understood that, since the electronic device may determine the first noise floor according to the ideal mask of the second audio signal, and further determine the second gain, and perform noise reduction on the second audio signal according to the second gain, the target audio signal is obtained. In this way, the second audio signal is subjected to noise reduction processing again by using the ideal mask of the second audio signal calculated by the target deep learning network model, so that the noise reduction effect on the audio signal is further improved.

It should be noted that, in the audio noise reduction method provided in the embodiment of the present application, the execution main body may be an audio noise reduction device, or a control module in the audio noise reduction device for executing the audio noise reduction method. In the embodiment of the present application, an audio noise reduction apparatus is taken as an example to execute an audio noise reduction method, and the apparatus provided in the embodiment of the present application is described.

As shown in fig. 4, an embodiment of the present application provides an audio noise reduction apparatus 400. The audio noise reduction apparatus 400 may include: an acquisition module 401, a first noise reduction module 402, a processing module 403 and a second noise reduction module 404. The obtaining module 401 may be configured to obtain a first audio signal, where the first audio signal includes a speech signal and a noise signal. The first noise reduction module 402 may be configured to perform pre-noise reduction on the first audio signal acquired by the acquisition module 401 to obtain a second audio signal, where a signal-to-noise ratio of the second audio signal is greater than a signal-to-noise ratio of the first audio signal. The processing module 403 may be configured to input the second audio signal into a target deep learning network model, so as to obtain an ideal mask of the second audio signal, where the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, where the third audio signal is a signal obtained from the second audio signal. The second noise reduction module 404 may be configured to perform noise reduction processing on the second audio signal according to the ideal mask of the second audio signal, so as to obtain a target audio signal.

Optionally, in this embodiment of the application, the first noise reduction module 402 may be specifically configured to perform stationary noise floor estimation on the first audio signal to obtain a first noise floor; calculating a first gain according to the first audio signal and the first background noise; and carrying out pre-noise reduction processing on the first signal according to the first gain to obtain the second audio signal.

Optionally, in this embodiment of the application, the obtaining module 401 may further be configured to obtain the training sample before the second audio signal is input to the target deep learning network model. The training samples include M third audio signals and M ideal masks for the third audio signals, each third audio signal corresponds to an ideal mask for a third audio signal, the M third audio signals are all signals in the second audio signal, and M is a positive integer. The processing module 403 is further configured to train the deep learning network model according to the training sample until a target evaluation condition is met, so as to obtain a target deep learning network model. Wherein the target evaluation condition comprises an evaluation function constructed by an ideal mask of the third audio signal.

Optionally, in this embodiment of the application, the processing module 403 may be specifically configured to, for one training sample in the training samples, extract a frequency domain magnitude spectrum of the training sample, and obtain an ideal mask corresponding to the training sample; and training the deep learning network model according to the frequency domain amplitude spectrum of the training sample and the ideal mask corresponding to the training sample.

Optionally, in this embodiment of the application, the second noise reduction module 404 may be specifically configured to perform noise floor estimation on the second audio signal according to an ideal mask of the second audio signal, so as to obtain a second noise floor; calculating a second gain according to the second audio signal and the second background noise; and carrying out noise reduction processing on the second signal according to the second gain to obtain the target audio signal.

The audio noise reduction apparatus in the embodiment of the present application may be a functional entity and/or a functional module in an electronic device, which executes an audio noise reduction method, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The audio noise reduction device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android operating system (Android), an iOS operating system, or other possible operating systems, which is not specifically limited in the embodiments of the present application.

The audio noise reduction device provided in the embodiment of the present application can implement each process implemented by the audio noise reduction device in the method embodiments of fig. 1 to fig. 4, and is not described herein again to avoid repetition.

The embodiment of the application provides an audio noise reduction device, which can acquire a first audio signal, wherein the first signal comprises a voice signal and a noise signal; carrying out pre-noise reduction processing on the first audio signal to obtain a second audio signal, wherein the signal-to-noise ratio of the second audio signal is greater than that of the first audio signal; inputting the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, wherein the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal; and performing noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal. According to the method, on one hand, the audio signal is subjected to pre-noise reduction to improve the signal-to-noise ratio of the audio signal, so that the characteristics of the voice signal in the audio signal are more prominent, and the accuracy of the trained target deep learning network model is improved; on the other hand, the ideal mask of the second audio signal calculated by the target deep learning network model is used for carrying out noise reduction processing on the second audio signal again, so that the noise reduction effect on the audio signal is further improved.

Optionally, as shown in fig. 5, an electronic device 500 is further provided in this embodiment of the present application, and includes a processor 501, a memory 502, and a program or an instruction stored in the memory 502 and executable on the processor 501, where the program or the instruction is executed by the processor 501 to implement each process of the above-mentioned audio noise reduction method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 6 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 2000 includes, but is not limited to: a radio frequency unit 2001, a network module 2002, an audio output unit 2003, an input unit 2004, a sensor 2005, a display unit 2006, a user input unit 2007, an interface unit 2008, a memory 2009, and a processor 2010.

Among other things, the input unit 2004 may include a graphic processor 20041 and a microphone 20042, the display unit 2006 may include a display panel 20061, the user input unit 2007 may include a touch panel 20071 and other input devices 20072, and the memory 2009 may be used to store software programs (e.g., an operating system, an application program required for at least one function), and various data.

Those skilled in the art will appreciate that the electronic device 2000 may further include a power supply (e.g., a battery) for supplying power to various components, and the power supply may be logically connected to the processor 2010 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system. The electronic device structure shown in fig. 6 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is omitted here.

Among other things, the microphone 20042 may be used to acquire a first audio signal, which includes a speech signal and a noise signal. The processor 2010 may be configured to perform pre-noise reduction on the first audio signal acquired by the microphone 20042 to obtain a second audio signal, where a signal-to-noise ratio of the second audio signal is greater than a signal-to-noise ratio of the first audio signal. The processor 2010 may be further configured to input the second audio signal to a target deep learning network model, so as to obtain an ideal mask of the second audio signal, where the target deep learning network model is trained by a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal. The processor 2010 may be further configured to perform noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal.

The embodiment of the application provides an electronic device, which can acquire a first audio signal, wherein the first signal comprises a voice signal and a noise signal; carrying out pre-noise reduction processing on the first audio signal to obtain a second audio signal, wherein the signal-to-noise ratio of the second audio signal is greater than that of the first audio signal; inputting the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, wherein the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal; and performing noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal. According to the method, on one hand, the audio signal is subjected to pre-noise reduction to improve the signal-to-noise ratio of the audio signal, so that the characteristics of the voice signal in the audio signal are more prominent, and the accuracy of the trained target deep learning network model is improved; on the other hand, the ideal mask of the second audio signal calculated by the target deep learning network model is used for carrying out noise reduction processing on the second audio signal again, so that the noise reduction effect on the audio signal is further improved.

Optionally, in this embodiment of the application, the processor 2010 may be specifically configured to perform stationary noise floor estimation on the first audio signal to obtain a first noise floor; calculating a first gain according to the first audio signal and the first background noise; and carrying out pre-noise reduction processing on the first signal according to the first gain to obtain the second audio signal.

It can be understood that, because the electronic device can determine the first background noise by a stable background noise estimation mode, and further determine the first gain, and perform pre-noise reduction on the first audio signal according to the first gain to obtain the second audio signal with higher signal-to-noise ratio, so that the characteristics of the voice signal in the second audio signal are more prominent, and on one hand, the training effect and accuracy are improved when the target deep learning network model is trained; on the other hand, in the process of judging the actual use target deep learning network model, the method is quicker, and the processing speed of the electronic equipment is improved.

Optionally, in this embodiment of the application, the microphone 20042 may be further configured to obtain a training sample before inputting the second audio signal into the target deep learning network model. The training samples include M third audio signals and M ideal masks for the third audio signals, each third audio signal corresponds to an ideal mask for a third audio signal, the M third audio signals are all signals in the second audio signal, and M is a positive integer. The processor 2010 is further configured to train the deep learning network model according to the training sample until a target evaluation condition is met, so as to obtain a target deep learning network model. Wherein the target evaluation condition comprises an evaluation function constructed by an ideal mask of the third audio signal.

Optionally, in this embodiment of the application, the processor 2010 may be specifically configured to, for one training sample of the training samples, extract a frequency-domain magnitude spectrum of the one training sample, and obtain an ideal mask corresponding to the one training sample; and training the deep learning network model according to the frequency domain amplitude spectrum of the training sample and the ideal mask corresponding to the training sample.

Optionally, in this embodiment of the application, the processor 2010 may be specifically configured to perform noise floor estimation on the second audio signal according to an ideal mask of the second audio signal to obtain a second noise floor; calculating a second gain according to the second audio signal and the second background noise; and carrying out noise reduction processing on the second signal according to the second gain to obtain the target audio signal.

The beneficial effects of the various implementation manners in this embodiment may specifically refer to the beneficial effects of the corresponding implementation manners in the above method embodiments, and are not described herein again to avoid repetition.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above audio noise reduction method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to execute a program or an instruction to implement each process of the above embodiment of the audio noise reduction method, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for audio noise reduction, the method comprising:

acquiring a first audio signal, wherein the first signal comprises a voice signal and a noise signal;

carrying out pre-noise reduction processing on the first audio signal to obtain a second audio signal, wherein the signal-to-noise ratio of the second audio signal is greater than that of the first audio signal;

inputting the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, wherein the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal;

and according to the ideal mask of the second audio signal, carrying out noise reduction processing on the second audio signal to obtain a target audio signal.

2. The method of claim 1, wherein pre-denoising the first audio signal to obtain a second audio signal comprises:

performing stationary background noise estimation on the first audio signal to obtain a first background noise;

calculating a first gain according to the first audio signal and the first background noise;

and according to the first gain, carrying out pre-noise reduction processing on the first signal to obtain the second audio signal.

3. The method of claim 1, wherein prior to inputting the second audio signal into a target deep learning network model, the method further comprises:

acquiring training samples, wherein the training samples comprise M third audio signals and M ideal masks of the third audio signals, each third audio signal corresponds to one ideal mask of the third audio signal, and M is a positive integer;

training a deep learning network model according to the training samples until a target evaluation condition is met to obtain the target deep learning network model;

wherein the target evaluation condition comprises an evaluation function constructed from an ideal mask of the third audio signal.

4. The method of claim 3, wherein training the deep learning network model according to the training samples comprises:

for one training sample in the training samples, extracting a frequency domain magnitude spectrum of the training sample, and acquiring an ideal mask corresponding to the training sample;

and training the deep learning network model according to the frequency domain amplitude spectrum of the training sample and the ideal mask corresponding to the training sample.

5. The method of claim 1, wherein performing noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal comprises:

performing background noise estimation on the second audio signal according to the ideal mask of the second audio signal to obtain a second background noise;

calculating a second gain according to the second audio signal and the second background noise;

and according to the second gain, carrying out noise reduction processing on the second signal to obtain the target audio signal.

6. An audio noise reduction apparatus, characterized in that the apparatus comprises: the device comprises an acquisition module, a first noise reduction module, a processing module and a second noise reduction module;

the acquisition module is used for acquiring a first audio signal, wherein the first signal comprises a voice signal and a noise signal;

the first noise reduction module is configured to perform pre-noise reduction on the first audio signal acquired by the acquisition module to obtain a second audio signal, where a signal-to-noise ratio of the second audio signal is greater than a signal-to-noise ratio of the first audio signal;

the processing module is configured to input the second audio signal into a target deep learning network model to obtain an ideal mask of the second audio signal, where the target deep learning network model is obtained by training a third audio signal and the ideal mask of the third audio signal, and the third audio signal is a signal obtained from the second audio signal;

and the second noise reduction module is used for performing noise reduction processing on the second audio signal according to the ideal mask of the second audio signal to obtain a target audio signal.

7. The apparatus according to claim 6, wherein the first noise reduction module is specifically configured to perform stationary noise floor estimation on the first audio signal to obtain a first noise floor; calculating a first gain according to the first audio signal and the first background noise; and according to the first gain, carrying out pre-noise reduction processing on the first signal to obtain the second audio signal.

8. The apparatus according to claim 6, wherein the obtaining module is further configured to obtain training samples before inputting the second audio signal into the target deep learning network model, where the training samples include M third audio signals and M ideal masks of the third audio signals, each third audio signal corresponds to one ideal mask of a third audio signal, and M is a positive integer;

the processing module is further used for training a deep learning network model according to the training samples until a target evaluation condition is met, so that the target deep learning network model is obtained;

9. The apparatus according to claim 8, wherein the processing module is specifically configured to, for one of the training samples, extract a frequency-domain magnitude spectrum of the one training sample, and obtain an ideal mask corresponding to the one training sample; and training the deep learning network model according to the frequency domain amplitude spectrum of the training sample and the ideal mask corresponding to the training sample.

10. The apparatus according to claim 6, wherein the second noise reduction module is specifically configured to perform a noise floor estimation on the second audio signal according to an ideal mask of the second audio signal to obtain a second noise floor; calculating a second gain according to the second audio signal and the second background noise; and according to the second gain, carrying out noise reduction processing on the second signal to obtain the target audio signal.

11. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the audio noise reduction method according to any one of claims 1 to 5.

12. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the audio noise reduction method according to any of claims 1 to 5.