CN115602183A

CN115602183A - Audio enhancement method and device, electronic equipment and storage medium

Info

Publication number: CN115602183A
Application number: CN202211167895.9A
Authority: CN
Inventors: 周阳
Original assignee: Guangzhou Boguan Information Technology Co Ltd
Current assignee: Guangzhou Boguan Information Technology Co Ltd
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2023-01-13

Abstract

The embodiment of the invention discloses an audio enhancement method, an audio enhancement device, electronic equipment and a storage medium; the method comprises the steps of obtaining an audio set comprising sample noiseless audio and sample noise audio, generating sample mixed audio based on the sample noiseless audio and the sample noise audio, carrying out noise separation on the sample mixed audio through an audio enhancement model to be trained to obtain training noiseless audio and training noise audio, calculating model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, adjusting the audio enhancement model to be trained based on the model loss to obtain a trained audio enhancement model, and carrying out audio enhancement on the audio to be enhanced through the trained audio enhancement model; the training precision of the audio enhancement model can be improved, the audio enhancement effect of the audio enhancement model is improved, and higher audio definition and perception quality can be obtained after audio enhancement processing is carried out on the audio.

Description

Audio enhancement method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of audio technologies, and in particular, to an audio enhancement method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of science and technology, people can use functions related to audio playing in various applications. For example, a user may play or send audio in an instant messaging application, a game application, a social application, or other application; or, after receiving the audio, the user can perform voice recognition on the audio to conveniently acquire information.

However, in the process of recording audio, the clarity of the recorded audio may be disturbed due to noise that is widely present in life. If the audio is not enhanced, the performance of audio related functions such as voice recognition and the like can be affected, and the experience of a user in audio playing can be affected.

Disclosure of Invention

Embodiments of the present invention provide an audio enhancement method and apparatus, an electronic device, and a storage medium, which can improve training accuracy of an audio enhancement model and improve an audio enhancement effect of the audio enhancement model, so that higher audio definition and perceptual quality can be obtained after audio enhancement processing is performed on audio.

The embodiment of the invention provides an audio enhancement method, which comprises the following steps:

acquiring an audio set comprising a sample noiseless audio and a sample noisy audio, and generating a sample mixed audio based on the sample noiseless audio and the sample noisy audio;

carrying out noise separation on the sample mixed audio through an audio enhancement model to be trained to obtain training noiseless audio and training noise audio;

calculating model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio;

and adjusting the audio enhancement model to be trained based on the model loss to obtain the trained audio enhancement model, so as to perform audio enhancement on the audio to be enhanced through the trained audio enhancement model.

Correspondingly, an embodiment of the present invention further provides an audio enhancement apparatus, including:

a sample obtaining unit, configured to obtain an audio set including a sample noiseless audio and a sample noise audio, and generate a sample mixed audio based on the sample noiseless audio and the sample noise audio;

the audio enhancement unit is used for carrying out noise separation on the sample mixed audio through an audio enhancement model to be trained to obtain a training noiseless audio and a training noise audio;

the loss calculation unit is used for calculating the model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio;

and the model adjusting unit is used for adjusting the audio enhancement model to be trained based on the model loss to obtain the trained audio enhancement model so as to perform audio enhancement on the audio to be enhanced through the trained audio enhancement model.

Optionally, the loss calculating unit is configured to calculate a noise-free frequency training loss of the audio enhancement model to be trained according to the sample noise-free audio and the training noise-free audio;

calculating the noise audio training loss of the audio enhancement model to be trained according to the sample noise audio and the training noise audio;

mixing the training noiseless audio and the training noise audio to obtain a training mixed audio, and calculating the mixed audio training loss of the audio enhancement model to be trained according to the training mixed audio and the sample mixed audio;

and calculating model loss corresponding to the audio enhancement model to be trained based on the noise-free frequency training loss, the noise audio training loss and the mixed audio training loss.

Optionally, the loss calculating unit is configured to obtain a first loss weight corresponding to the noise-free audio training loss, a second loss weight corresponding to the noise-free audio training loss, and a third loss weight corresponding to the mixed audio training loss;

and according to the first loss weight, the second loss weight and the third loss weight, carrying out weighted calculation on the noise-free audio training loss, the noise audio training loss and the mixed audio training loss to obtain a model loss corresponding to the audio enhancement model to be trained.

Optionally, the audio enhancement unit is configured to map, through a coding layer of an audio enhancement model to be trained, a frequency domain representation of the sample mixed audio to initial features of the sample mixed audio;

respectively extracting features of at least two sub-features of the initial features through a feature extraction layer of the audio enhancement model to be trained to obtain the spectral features of the sub-features, and extracting the relationship between the features of the at least two sub-features to obtain the spectral relation features between the sub-features;

decoding and mapping each spectrum characteristic and each spectrum connection characteristic through a decoding layer of the audio enhancement model to be trained to obtain a frequency domain representation of training noiseless audio and a frequency domain representation of training noise audio;

and obtaining the training noiseless audio and the training noise audio according to the frequency domain representation of the training noiseless audio and the frequency domain representation of the training noise audio.

Optionally, the audio enhancement device provided in the embodiment of the present invention further includes an audio enhancement unit, where the audio enhancement unit includes an audio obtaining subunit and an audio enhancer unit, and the audio obtaining subunit is configured to obtain the audio to be enhanced;

and the audio enhancement unit is used for performing audio enhancement on the audio to be enhanced by adopting the trained audio enhancement model to obtain the enhanced audio corresponding to the audio to be enhanced.

Optionally, the audio obtaining subunit is configured to obtain a time domain representation of the audio to be enhanced, and convert the time domain representation into frequency domain representations with a preset number, where the preset number is at least 2;

determining real part characteristics and imaginary part characteristics corresponding to the frequency domain representations from the frequency domain representations of the audio to be enhanced, and fusing the real part characteristics and the imaginary part characteristics to obtain to-be-processed characteristics corresponding to the audio to be enhanced;

the audio enhancement unit is used for performing noise separation processing on the features to be processed by adopting the trained audio enhancement model to obtain enhanced real part features and enhanced imaginary part features of the enhanced audio corresponding to the audio to be enhanced, wherein the number of the enhanced real part features and the number of the enhanced imaginary part features are the same as the preset number;

and performing time domain conversion based on the enhanced real part characteristic and the enhanced imaginary part characteristic to obtain an enhanced audio corresponding to the audio to be enhanced.

Optionally, the audio enhancer unit is configured to perform time domain conversion based on the enhanced real part feature and the enhanced imaginary part feature to obtain enhanced sub-audio with a quantity equal to the preset quantity;

and synthesizing each enhanced sub-audio through a preset synthesis filter to obtain an enhanced audio corresponding to the audio to be enhanced.

Correspondingly, the embodiment of the invention also provides the electronic equipment, which comprises a memory and a processor; the memory stores an application program, and the processor is used for running the application program in the memory to execute the steps in any audio enhancement method provided by the embodiment of the invention.

Accordingly, the embodiment of the present invention further provides a computer-readable storage medium, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the audio enhancement methods provided by the embodiment of the present invention.

Furthermore, the embodiment of the present invention further provides a computer program product, which includes a computer program or instructions, and when the computer program or instructions are executed by a processor, the computer program or instructions implement the steps in any audio enhancement method provided by the embodiment of the present invention.

By adopting the scheme of the embodiment of the invention, an audio set comprising a sample noiseless audio and a sample noise audio can be obtained, a sample mixed audio is generated based on the sample noiseless audio and the sample noise audio, the sample mixed audio is subjected to noise separation through an audio enhancement model to be trained to obtain a training noiseless audio and a training noise audio, the model loss corresponding to the audio enhancement model to be trained is calculated according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, the audio enhancement model to be trained is adjusted based on the model loss to obtain a trained audio enhancement model, and the audio enhancement is performed on the audio to be enhanced through the trained audio enhancement model; in the embodiment of the invention, when the model loss corresponding to the audio enhancement model to be trained is calculated, the loss calculation is carried out by integrating the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, and compared with the method for calculating the loss only according to the sample noiseless audio and the training noiseless audio, the information which can participate in the calculation is richer when the loss is calculated, and the calculated model loss is beneficial to more accurately adjusting the model, so that the training precision of the audio enhancement model can be improved, the audio enhancement effect of the audio enhancement model can be improved, and higher audio definition and higher perception quality can be obtained after audio enhancement processing is carried out on the audio.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

Fig. 1 is a schematic view of a scene of an audio enhancement method provided by an embodiment of the present invention;

FIG. 2 is a flow chart of an audio enhancement method provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of a technical implementation of an audio enhancement model provided by an embodiment of the invention;

FIG. 4 is a schematic diagram of a technical implementation in which an audio enhancement model includes a sequence conversion layer according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of prediction using a common feature extraction layer for subband signals according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of an audio enhancement model provided by an embodiment of the invention;

FIG. 7 is a schematic diagram of an analysis and synthesis filter provided by an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of an audio enhancement apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an audio obtaining unit in an audio enhancement device according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

The embodiment of the invention provides an audio enhancement method, an audio enhancement device, electronic equipment and a computer-readable storage medium. In particular, embodiments of the present invention provide an audio enhancement method suitable for an audio enhancement apparatus, which may be integrated in an electronic device.

The electronic device may be a terminal or other devices, including but not limited to a mobile terminal and a fixed terminal, for example, the mobile terminal includes but is not limited to a smart phone, a smart watch, a tablet computer, a notebook computer, a smart car, and the like, wherein the fixed terminal includes but is not limited to a desktop computer, a smart television, and the like.

The electronic device may also be a server or other devices, where the server may be an independent physical server, a server cluster or a distributed system formed by multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (Content Delivery Network), and a big data and artificial intelligence platform, but is not limited thereto.

The audio enhancement method of the embodiment of the invention can be realized by a server, and can also be realized by a terminal and the server together.

The method is described below by taking an example in which the terminal and the server jointly implement the audio enhancement method.

As shown in fig. 1, the audio enhancement system provided by the embodiment of the present invention includes a terminal 10, a server 20, and the like; the terminal 10 and the server 20 are connected through a network, for example, a wired or wireless network connection, wherein the server 20 may exist as an electronic device that receives the audio to be enhanced sent by the terminal 10 or sends the audio processed by the audio to the terminal 10.

Among other things, the terminal 10 may be used for a user to record audio, transmit audio, and/or receive audio transmitted by the server 20.

The server 20 may obtain an audio set including a sample noiseless audio and a sample noise audio, generate a sample mixed audio based on the sample noiseless audio and the sample noise audio, perform noise separation on the sample mixed audio through the audio enhancement model to be trained, obtain a training noiseless audio and a training noise audio, calculate a model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, adjust the audio enhancement model to be trained based on the model loss, obtain the trained audio enhancement model, and perform audio enhancement on the audio to be enhanced through the trained audio enhancement model.

It is understood that in some embodiments, the step of audio enhancement performed by the server 20 may also be performed by the terminal 10, which is not limited by the embodiment of the present invention.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

Embodiments of the present invention will be described from the perspective of an audio enhancement device, which may be specifically integrated in a server or a terminal.

As shown in fig. 2, the specific flow of the audio enhancement method of the present embodiment may be as follows:

201. an audio set including sample noiseless audio and sample noisy audio is obtained, and sample mixed audio is generated based on the sample noiseless audio and the sample noisy audio.

Wherein, the sample noiseless audio is the audio without noise of the sample required in the training process for generating the audio enhancement model to be trained. For example, the sample noiseless audio may be audio recorded in a quiet environment such as a recording studio.

It is understood that one hundred percent of noiseless audio is not required in the sample noiseless audio, as long as the noise therein satisfies the preset noiseless judgment condition. For example, as long as the noise in an audio does not affect the speech recognition of the content of actual interest in the audio (for example, the text result of the speech recognition matches the text result of the content of actual interest in the audio), the noise in the audio can be considered to satisfy the noise-free determination condition, and the audio can be used as the sample noise-free audio. The noise-free judgment condition can be set by a technician according to actual application, and the embodiment of the invention does not limit the noise-free judgment condition.

Specifically, the sample noise audio is the only noise-containing audio of the samples required in the training process for generating the audio enhancement model to be trained. For example, the sample noiseless frequency may be audio generated from recorded traffic noise, building construction noise, social life noise, or the like, or may be audio composed of randomly generated irregular audio signals.

It will be appreciated that the number of sample noiseless audio and sample noisy audio in the audio set may be at least one. For example, multiple samples of noiseless audio and sample of noisy audio may be included in an audio set in order to generate enough sample mixed audio to increase sample capacity.

The number of the sample noiseless audio and the number of the sample noisy audio in the audio set may be the same or different, which is not limited in the embodiment of the present invention. For example, the audio set may include 10 sample noisy audios with different noise frequencies, and 3 sample noiseless audios with different noise frequencies; alternatively, the audio set may include 1 sample noisy audio, and 5 distinct samples non-noisy audio, and so on.

Specifically, when generating the sample mixed audio, all the audio of the sample noiseless audio and the sample noise audio with the same duration may be fused to obtain the sample mixed audio; alternatively, a partially continuous or non-continuous audio segment in the sample noiseless audio may be fused with the sample noisy audio.

In an actual application process, if the scenes or contents corresponding to the sample noiseless audio and/or the sample noisy audio are relatively single, the audio enhancement model may not learn other types of scenes or contents well, and if the audio including other scenes or contents appears, the audio enhancement model may not enhance the audio well. To solve this problem, the sample noiseless audio and/or sample noise sound may be extended such that all possible scenes and contents may be included in the audio set, for example, daily calls in scenes with traffic noise, meeting recordings in scenes with building noise, etc.

202. And carrying out noise separation on the sample mixed audio through the audio enhancement model to be trained to obtain a training noiseless audio and a training noise audio.

The audio enhancement method provided by the embodiment of the invention relates to the technical field of Machine Learning (ML) of artificial intelligence, in particular to the technical field of artificial neural networks (artificial neural networks) in the Deep Learning (Deep Learning) field, and the audio enhancement model in the embodiment can be constructed based on the structure of the artificial neural network.

In some optional embodiments, the audio enhancement model provided in the embodiments of the present invention may include a coding layer, a feature extraction layer, and a decoding layer, and at this time, the step "performing noise separation on the sample mixed audio through the audio enhancement model to be trained to obtain a training noiseless audio and a training noise audio" may specifically include:

mapping the frequency domain representation of the sample mixed audio to initial characteristics of the sample mixed audio through an encoding layer of an audio enhancement model to be trained;

decoding and mapping each frequency spectrum characteristic and each frequency spectrum connection characteristic through a decoding layer of an audio enhancement model to be trained to obtain frequency domain representation of training noiseless frequency and frequency domain representation of training noise audio;

Wherein the encoding layer may map the frequency domain representation of the sample mixed audio into a form of mathematical representation that is understandable to a computer, i.e., the initial features of the sample mixed audio. For example, the coding layers may include at least one convolutional layer, which may be a one-dimensional convolutional (Conv-1D) layer, or the like. If the coding layer is a Conv-1D layer, a better coding effect can be realized because the convolution kernel of the Conv-1D layer is smaller.

As another example, the encoder may also implement extracting local features from a frequency domain representation (e.g., a spectrogram) of the sample-mixed audio and reducing feature resolution for a two-dimensional convolution (Conv-2D) layer.

In particular, the decoding layer may convert mathematical representations of the training noiseless audio and the training noisy audio into frequency domain representations. For example, the decoder uses the convolutional layer transposed from the coding layer to restore the low resolution features to the original size, forming a symmetric structure with the encoder.

The convolutional layer is a network structure capable of extracting features, for example, the convolutional network may include a convolutional layer, and the convolutional layer may extract features through a convolution operation.

In an alternative example, the encoder may include a plurality of convolutional layers, each convolutional layer having at least one convolution unit therein, different convolution units may extract different features, and when feature extraction is performed through the convolutional layers, the frequency domain representation of the sample mixed audio is scanned through the convolution units, and different features are learned by different convolution kernels to extract features of the frequency domain representation of the sample mixed audio.

Alternatively, as shown in fig. 3, there may be a skipped connection between the encoder and decoder to communicate detailed information.

In the embodiment of the present invention, the feature extraction layer may divide the initial feature to obtain at least two sub-features. The feature extraction layer may perform feature extraction on at least two sub-features of the initial feature, and perform feature extraction on an association relationship between the sub-features.

Optionally, the feature extraction layer may specifically include the number of feature extraction layers used for extracting feature information in the audio enhancement model, the number of input channels of the feature extraction layers, and the like. For example, if the feature extraction layers of the audio enhancement model are convolutional layers, the feature extraction layers of the audio enhancement model may include the number of convolutional layers, the size of convolutional cores in the convolutional layers, and/or the number of input channels corresponding to each convolutional layer, and so on.

It is understood that there may be only one feature extraction layer for extracting features in the audio enhancement model, or there may be multiple feature extraction layers for extracting features from the sample image. When the audio enhancement model has a plurality of feature extraction layers, with the increase of extraction operations of the feature extraction layers, content information about sub-features in the spectral features and/or spectral relation features extracted by each feature extraction layer is less and more, and feature information is more and more.

Specifically, the audio enhancement model may use the same feature extraction layer to perform feature extraction on each sub-feature and perform feature extraction on the association relationship between each sub-feature.

Or, the feature extraction layer may include a first feature extraction layer and a second feature extraction layer that are constructed based on the RNN, where the first feature extraction layer may perform feature extraction on at least two sub-features of the initial feature respectively to obtain a spectral feature of each sub-feature, and the second feature extraction layer may perform feature-to-feature relation extraction on at least two sub-features to obtain a spectral relation feature between each sub-feature.

In some optional examples, only one layer of the first feature extraction layer and the second feature extraction layer may be connected to each other in the audio enhancement model, or there may be several connected first feature extraction layers and second feature extraction layers in the audio enhancement model.

The extracted features of the initial features by the first feature extraction layer of the audio enhancement model may specifically be local features of a spectrogram corresponding to the initial features, such as contrast, detail definition, shadow, highlight, and the like, and the local features of the spectrogram may reflect local specificity of the spectrogram; and may also be global features of the spectrogram, such as line features, color features, texture features, structural features, and the like.

As shown in fig. 4, the first feature extraction layer may include a BiLSTM layer, an FC layer, and an iLN layer, and the first feature extraction layer may model each sub-feature based on a spectral pattern. Modeling the frequency dependence is beneficial for speech enhancement due to the harmonic structure of speech. The first feature extraction layer constructed based on the RNN can overcome the defect of limited CNN acceptance field and capture long-term harmonic correlation. BilSTM was used to perform intra-sub-feature modeling, which did not affect the causality of the overall system.

The second feature extraction layer may comprise an LSTM layer, an FC layer, an iLN layer. The temporal dependence of a certain frequency is modeled in the second feature extraction layer using LSTM, which guarantees strict causal relationships. These LSTM are computed in parallel.

The BiLSTM of the first feature extraction layer and the LSTM of the second feature extraction layer are followed by a full connectivity layer (FC) and a normalization Layer (LN). A residual connection can then be applied between the input of the RNN and the output of the LN to further mitigate the problem of gradient vanishing.

Optionally, as shown in fig. 4, the audio enhancement model may further include a sequence conversion layer, which converts the features output by the first feature extraction layer or the second feature extraction layer into a feature sequence.

203. And calculating the model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio.

Specifically, step 203 may include:

calculating the noise-free frequency training loss of the audio enhancement model to be trained according to the noise-free audio and the training noise-free audio;

mixing the training noiseless audio and the training noise audio to obtain training mixed audio, and calculating the mixed audio training loss of the audio enhancement model to be trained according to the training mixed audio and the sample mixed audio;

and calculating model loss corresponding to the audio enhancement model to be trained based on the noiseless audio training loss, the noisy audio training loss and the mixed audio training loss.

That is to say, in the embodiment of the present invention, the loss of the training noiseless audio output by the audio enhancement model is calculated from the real sample noiseless audio; loss is calculated between training noise audio output by the model and real sample noise audio; and after the training noiseless audio output by the model is mixed with the training noise audio, the loss is solved by mixing the training noiseless audio and the training noise audio with a real sample.

In some alternative embodiments, the model loss may be directly added by the noiseless audio training loss, the noisy audio training loss, and the mixed audio training loss.

In other alternative embodiments, different computational weights may be set for the noise-free audio training loss, the noise audio training loss, and the mixed audio training loss in order to enhance the robustness of the model. The step of calculating the model loss corresponding to the audio enhancement model to be trained based on the noise-free audio training loss, the noise audio training loss and the mixed audio training loss may specifically include:

acquiring a first loss weight corresponding to noise-free audio training loss, a second loss weight corresponding to noise audio training loss and a third loss weight corresponding to mixed audio training loss;

and according to the first loss weight, the second loss weight and the third loss weight, carrying out weighted calculation on the noise-free audio training loss, the noise audio training loss and the mixed audio training loss to obtain the model loss corresponding to the audio enhancement model to be trained.

Optionally, in order to reduce the loss value of the trained audio enhancement model for noise-free audio, the first loss weight may be set to be greater than the second loss weight and the third loss weight. For example, the first loss weight may be set to 0.5, the second loss weight to 0.3, and the third loss weight to 0.2.

Or the first loss weight and the second loss weight may be determined according to a proportion of a duration of an audio fragment fused with the sample noise audio in the sample noiseless audio to a duration of the sample noiseless audio.

204. And adjusting the audio enhancement model to be trained based on the model loss to obtain the trained audio enhancement model, so as to perform audio enhancement on the audio to be enhanced through the trained audio enhancement model.

The parameters and the like of the audio enhancement model can be adjusted through the pre-training process of the audio enhancement model, so that the audio enhancement model can achieve better audio enhancement performance.

For example, in the embodiment of the present invention, the redundancy problem of the model is considered, and in order to accelerate the inference speed of the model and the size of the occupied memory; through experimental comparison, the cycle number of a feature extraction layer in the audio enhancement model can be reduced, and the cycle number is reduced to one time. The CPU occupation is reduced to 30% from 36% of a single core, and the inference speed of a single frame is reduced to 1.380ms from 2.497 ms.

For another example, the coding layer of the adjusted audio enhancement model may be composed of 5 convolution layers, where the convolution kernels are [32, 32, 32, 64, 128]; the decoding layer consists of 5 deconvolution layers, the convolution kernels are [64, 32, 32, 32, 16], respectively.

In some optional embodiments, after the training of the audio enhancement model to be trained is completed, the audio enhancement model after training may be used to process the audio that needs to be enhanced, that is, the audio enhancement method provided in the embodiments of the present invention may further include:

acquiring audio to be enhanced;

and performing audio enhancement on the audio to be enhanced by adopting the trained audio enhancement model to obtain enhanced audio corresponding to the audio to be enhanced.

Alternatively, obtaining the audio to be enhanced may be obtaining an uninterrupted frequency domain representation of the audio to be enhanced. Alternatively, to increase the audio enhancement speed, the frequency domain representation of the audio to be enhanced may be divided into multiple sub-band signals in a multi-band manner. That is, the step "acquiring the audio to be enhanced" may specifically include:

acquiring time domain representation of audio to be enhanced, and converting the time domain representation into frequency domain representations with preset number, wherein the preset number is at least 2;

and determining real part characteristics and imaginary part characteristics corresponding to the frequency domain representations of the audio to be enhanced, and fusing the real part characteristics and the imaginary part characteristics to obtain the characteristics to be processed corresponding to the audio to be enhanced.

Specifically, the time domain representation is converted into a frequency domain representation, which may be implemented by short-time Fourier transform (STFT). In an embodiment of the present invention, the audio to be enhanced may be processed into N subbands (i.e., frequency domain representation) by an analysis filter, and the subbands may be STFT performed.

For example, the audio to be enhanced represented in the time domain may be cut from a one-dimensional vector into four-dimensional subband vectors by an analysis filter; and respectively carrying out short-time Fourier transformation on each dimension of the four-dimensional vector to obtain real part characteristics and imaginary part characteristics of each one-dimensional vector after the short-time Fourier transformation.

The fusion of the real part characteristic and the imaginary part characteristic may be to splice the real part characteristic and the imaginary part characteristic to obtain the characteristic to be processed, or to regularize the spliced result to obtain the characteristic to be processed.

The multiband algorithm can utilize the sparsity of a neural network model to predict all subband signals by using a single shared feature extraction layer. More specifically, the shared feature extraction layer takes as input all sub-band sample steps previously predicted and predicts the next sample in all sub-bands in one inference step, as shown in fig. 5. With a shared feature extraction layer structure, the audio in each sub-band can be down sampled N (number of bands) times, so the total computational cost can be reduced.

Correspondingly, the step of performing audio enhancement on the audio to be enhanced by using the trained audio enhancement model to obtain the enhanced audio corresponding to the audio to be enhanced may specifically include:

performing noise separation processing on the features to be processed by adopting the trained audio enhancement model to obtain enhanced real part features and enhanced imaginary part features of the enhanced audio corresponding to the audio to be enhanced, wherein the number of the enhanced real part features and the enhanced imaginary part features is the same as the preset number;

For example, for the case of segmentation into four-dimensional subband vectors, the trained audio enhancement model may perform noise separation processing on the feature to be processed to obtain a 4-dimensional real-part subnet mask (enhanced real-part feature) and a 4-dimensional imaginary-part subnet mask (enhanced imaginary-part feature) of the enhanced audio.

Specifically, the time domain conversion may be performed on the real part enhancement feature and the imaginary part enhancement feature through inverse short-time fourier transform (iSTFT), and the step "time domain conversion is performed based on the real part enhancement feature and the imaginary part enhancement feature to obtain the enhanced audio corresponding to the audio to be enhanced" may include:

performing time domain conversion based on the enhanced real part characteristic and the enhanced imaginary part characteristic to obtain enhanced sub-audio with the same quantity as the preset quantity;

For example, after obtaining the enhanced real part feature and the enhanced imaginary part feature, the enhanced real part feature and the enhanced imaginary part feature may be subjected to a mask layer and then subjected to inverse short-time fourier transform, so as to obtain a 4-dimensional enhanced sub-audio. The 4-dimensional enhanced sub-audio is processed by a synthesis filter to obtain a one-dimensional audio, which is the enhanced audio.

Wherein the predicted audio signal in each frequency band is first up-sampled and then passed to a synthesis filter. The sum of each frequency band of the signal after the synthesis filter is a single audio signal. Upsampling is accomplished by filling in zeros between the original signals. With the synthesis filter, the CPU occupancy is reduced from 30% to 17% for a single core, and single frame reasoning is reduced from 1.380ms to 0.851ms.

It is understood that the multiband algorithm can also be applied in the pre-training process of the model, taking sample noiseless audio and sample mixed audio as four-dimensional subband signals as an example, and the output of the audio enhancement model is a 16-dimensional vector representing the 4-dimensional real-part subnet mask and the 4-dimensional imaginary-part subnet mask of the training noiseless audio and the 4-dimensional real-part subnet mask and the 4-dimensional imaginary-part subnet mask of the training noisy audio respectively.

Correspondingly, after the enhanced real part characteristic and the enhanced imaginary part characteristic of the training noiseless audio and the training noise audio pass through a mask layer and then are subjected to inverse Fourier transform, the 4-dimensional enhanced sub-audio of the training noiseless audio and the 4-dimensional enhanced sub-audio of the training noise audio can be obtained.

When loss is calculated, the 4-dimensional enhanced sub-audio of the training noiseless audio output by the audio enhancement model and the real 4-dimensional sample noiseless audio respectively solve the loss; the 4-dimensional enhanced sub-audio of the training noise audio output by the model is respectively lost with the real 4-dimensional sample noise audio; after the 4-dimensional enhanced sub-audio of the training noiseless audio output by the model is mixed with the 4-dimensional enhanced sub-audio of the training noise audio, the loss is respectively calculated with the real sample mixed audio, and the loss values are 12 in total.

As can be seen from the above, the embodiment of the present invention may obtain an audio set including a sample noiseless audio and a sample noise audio, generate a sample mixed audio based on the sample noiseless audio and the sample noise audio, perform noise separation on the sample mixed audio through an audio enhancement model to be trained, obtain a training noiseless audio and a training noise audio, calculate a model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, adjust the audio enhancement model to be trained based on the model loss, obtain the trained audio enhancement model, and perform audio enhancement on the audio to be enhanced through the trained audio enhancement model; in the embodiment of the invention, when the model loss corresponding to the audio enhancement model to be trained is calculated, the loss calculation is carried out by integrating the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, and compared with the method for calculating the loss only according to the sample noiseless audio and the training noiseless audio, the information which can participate in the calculation is richer when the loss is calculated, and the calculated model loss is beneficial to more accurately adjusting the model, so that the training precision of the audio enhancement model can be improved, the audio enhancement effect of the audio enhancement model can be improved, and higher audio definition and higher perception quality can be obtained after audio enhancement processing is carried out on the audio.

The method according to the preceding embodiment is illustrated in further detail below by way of example.

In this embodiment, a description will be given in conjunction with the system of fig. 1.

As shown in fig. 6, the specific flow of the audio enhancement method of the present embodiment may be as follows:

601. an audio set comprising sample noiseless audio and sample noisy audio is obtained, and sample mixed audio of a time domain representation is generated based on the sample noiseless audio and the sample noisy audio.

602. The sample mixed audio of the time domain representation is sliced from the one-dimensional vector into N-dimensional subband vectors by an analysis filter.

Under normal conditions, when audio enhancement is carried out in real time, only one frame of audio is predicted at a time; after the multi-band algorithm is added, the model can input four frames of audio at one time, meanwhile, the calculation of the model is not increased, and then the four frames of audio can be output at one time. If the model is compared to a black box, frame-by-frame prediction is carried out at the beginning, one frame is predicted at a time, 10S is consumed for predicting 10 frames if 1 second is consumed for predicting one frame, and similarly, only one second is consumed for predicting 4 frames after the multiband algorithm is added, but actually, the final prediction of 4 frames takes about 1.6S due to the overhead of other calculations, and the audio enhancement speed of the model is accelerated by about 2.5 times.

603. And respectively carrying out short-time Fourier transformation on each dimension of the N-dimension sub-band vector to obtain real part characteristics and imaginary part characteristics after the short-time Fourier transformation of each dimension vector, and fusing the real part characteristics and the imaginary part characteristics of the N-dimension sub-band vector to obtain the to-be-processed characteristics corresponding to the to-be-enhanced audio.

604. And performing noise separation on the to-be-processed features through the to-be-trained audio enhancement model to obtain enhanced real part features and enhanced imaginary part features of the training noiseless audio corresponding to the sample mixed audio and the enhanced real part features and enhanced imaginary part features of the training noise audio.

605. And performing time domain conversion based on the enhanced real part characteristic and the enhanced imaginary part characteristic to obtain enhanced sub-audios of the N training noiseless audios and enhanced sub-audios of the N training noise audios, and synthesizing each enhanced sub-audio through a preset synthesis filter to obtain the training noiseless audio and the training noise audio corresponding to the sample mixed audio.

For example, the analysis and synthesis filters may be as shown in FIG. 7. A stable but more efficient low cost filter bank called Pseudo-quadrature mirror filter bank (Pseudo-QMF) can be employed for multi-band processing. The pseudo-QMF is a Cosine Modulated Filter Bank (CMFB) in which all filters are cosine modulated versions of the low-pass prototype filter. The prototype filter is designed to have a linear phase leading to an analysis/synthesis system without phase distortion. Due to the aliasing cancellation constraints of the desired filter bank, the output aliasing is at the stop-band attenuation level.

Thus, the attenuation characteristics of a high-band analysis filter and a synthesis filter can be on the order of-100 dB. A Finite Impulse Response (FIR) analysis/synthesis filter, order 63, may be selected for a uniformly spaced 4-band implementation, taking into account the computational efficiency of the multi-band processing.

606. And calculating the model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio.

For example, taking N =4 as an example, when the loss is calculated, the loss is calculated by the 4-dimensional enhanced sub-audio of the training noiseless audio output by the audio enhancement model and the real 4-dimensional sample noiseless audio respectively; the 4-dimensional enhanced sub-audio of the training noise audio output by the model is respectively lost with the real 4-dimensional sample noise audio; after the 4-dimensional enhanced sub-audio of the training noiseless audio output by the model is mixed with the 4-dimensional enhanced sub-audio of the training noise audio, the loss is respectively calculated with the real sample mixed audio, and the loss values are 12 in total.

In order to increase the robustness of the model and reduce the loss value of the model noiseless audio, the loss corresponding to the 4-dimensional sample noiseless audio accounts for 50%, the loss corresponding to the 4-dimensional sample noisy audio accounts for 30%, and the loss corresponding to the 4-dimensional sample mixed audio accounts for 20%.

607. And adjusting the audio enhancement model to be trained based on the model loss to obtain the trained audio enhancement model.

For example, the coding layer of the trained audio enhancement model may consist of 5 convolutional layers, with convolutional kernels [32, 32, 32, 64, 128]; the core layer is a feature extraction layer which consists of a first feature extraction layer and a second feature extraction layer, the first feature extraction layer is a bidirectional rnn layer and is externally connected with a density layer and a regularization layer, and the second feature extraction layer is a unidirectional rnn layer and is additionally connected with a density layer and a regularization layer; the decoding layer of the model consists of 5 deconvolution layers, the convolution kernels are [64, 32, 32, 32, 16], respectively.

608. And acquiring the audio to be enhanced, and performing audio enhancement on the audio to be enhanced by adopting the trained audio enhancement model to obtain the enhanced audio corresponding to the audio to be enhanced.

In the practical application process, the trained audio enhancement model can only output the N-dimensional enhanced sub-audio, and the N-dimensional enhanced sub-audio passes through a synthesis filter to obtain a one-dimensional audio, which is the audio required by people.

As can be seen from the above, the embodiment of the present invention may obtain an audio set including a sample noiseless audio and a sample noise audio, generate a sample mixed audio based on the sample noiseless audio and the sample noise audio, perform noise separation on the sample mixed audio through an audio enhancement model to be trained, obtain a training noiseless audio and a training noise audio, calculate a model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, adjust the audio enhancement model to be trained based on the model loss, obtain the trained audio enhancement model, and perform audio enhancement on the audio to be enhanced through the trained audio enhancement model; in the embodiment of the invention, when the model loss corresponding to the audio enhancement model to be trained is calculated, the loss calculation is carried out by integrating the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, compared with the loss calculation only according to the sample noiseless audio and the training noiseless audio, the information which can participate in the calculation is richer when the loss is calculated, and the model loss obtained by calculation is beneficial to more accurately adjusting the model, so that the training precision of the audio enhancement model can be improved, the audio enhancement effect of the audio enhancement model can be improved, and higher audio definition and higher perception quality can be obtained after the audio enhancement processing is carried out on the audio.

In order to better implement the above method, correspondingly, the embodiment of the invention also provides an audio enhancement device.

Referring to fig. 8, the apparatus includes:

a sample obtaining unit 801, configured to obtain an audio set including a sample noiseless audio and a sample noise audio, and generate a sample mixed audio based on the sample noiseless audio and the sample noise audio;

the audio enhancement unit 802 may be configured to perform noise separation on the sample mixed audio through an audio enhancement model to be trained, so as to obtain a training noiseless audio and a training noise audio;

a loss calculating unit 803, configured to calculate a model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio, and the training noise audio;

the model adjusting unit 804 may be configured to adjust the audio enhancement model to be trained based on the model loss to obtain the trained audio enhancement model, so as to perform audio enhancement on the audio to be enhanced through the trained audio enhancement model.

In some optional embodiments, the loss calculating unit 803 may be configured to calculate a noise-free training loss of the audio enhancement model to be trained according to the sample noise-free audio and the training noise-free audio;

and calculating model loss corresponding to the audio enhancement model to be trained based on the noise-free audio training loss, the noise audio training loss and the mixed audio training loss.

In some optional embodiments, the loss calculating unit 803 may be configured to obtain a first loss weight corresponding to a noise-free audio training loss, a second loss weight corresponding to a noise-free audio training loss, and a third loss weight corresponding to a mixed audio training loss;

In some optional embodiments, the audio enhancement unit 802 may be configured to map the frequency domain representation of the sample mixed audio to initial features of the sample mixed audio through an encoding layer of an audio enhancement model to be trained;

In some optional embodiments, as shown in fig. 9, the audio enhancement apparatus provided in the embodiment of the present invention may further include an audio enhancement unit, the audio enhancement unit 802 may include an audio obtaining subunit 805 and an audio enhancer unit 806, and the audio obtaining subunit 805 may be configured to obtain the audio to be enhanced;

the audio enhancement unit 806 may be configured to perform audio enhancement on the audio to be enhanced by using the trained audio enhancement model, so as to obtain an enhanced audio corresponding to the audio to be enhanced.

In some optional embodiments, the audio obtaining subunit may be configured to obtain a time-domain representation of the audio to be enhanced, and convert the time-domain representation into a preset number of frequency-domain representations, where the preset number is at least 2;

determining real part characteristics and imaginary part characteristics corresponding to each frequency domain representation from each frequency domain representation of the audio to be enhanced, and fusing the real part characteristics and the imaginary part characteristics to obtain the characteristics to be processed corresponding to the audio to be enhanced;

the audio enhancer unit 806 may be configured to perform noise separation processing on the feature to be processed by using the trained audio enhancement model to obtain an enhanced real part feature and an enhanced imaginary part feature of the enhanced audio corresponding to the audio to be enhanced, where the number of the enhanced real part feature and the enhanced imaginary part feature is the same as the preset number;

In some optional embodiments, the audio enhancer unit 806 may be configured to perform time-domain transformation based on the enhanced real part features and the enhanced imaginary part features to obtain the same number of enhanced sub-audios as the preset number;

From the above, by the audio enhancement device, an audio set including a sample noiseless audio and a sample noise audio can be obtained, a sample mixed audio is generated based on the sample noiseless audio and the sample noise audio, the sample mixed audio is subjected to noise separation by the audio enhancement model to be trained, a training noiseless audio and a training noise audio are obtained, a model loss corresponding to the audio enhancement model to be trained is calculated according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, the audio enhancement model to be trained is adjusted based on the model loss, a trained audio enhancement model is obtained, and the audio enhancement to be enhanced is performed by the trained audio enhancement model; in the embodiment of the invention, when the model loss corresponding to the audio enhancement model to be trained is calculated, the loss calculation is carried out by integrating the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, compared with the loss calculation only according to the sample noiseless audio and the training noiseless audio, the information which can participate in the calculation is richer when the loss is calculated, and the model loss obtained by calculation is beneficial to more accurately adjusting the model, so that the training precision of the audio enhancement model can be improved, the audio enhancement effect of the audio enhancement model can be improved, and higher audio definition and higher perception quality can be obtained after the audio enhancement processing is carried out on the audio.

In addition, correspondingly, the embodiment of the application also provides a computer device, and the computer device can be a terminal. As shown in fig. 10, fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device 1000 includes a processor 1001 with one or more processing cores, a memory 1002 with one or more computer-readable storage media, and a computer program stored on the memory 1002 and executable on the processor. The processor 1001 is electrically connected to the memory 1002. Those skilled in the art will appreciate that the computer device configurations illustrated in the figures are not meant to be limiting of computer devices, and may include more or fewer components than those illustrated, or combinations of certain components, or different arrangements of components.

The processor 1001 is a control center of the computer apparatus 1000, connects various parts of the entire computer apparatus 1000 using various interfaces and lines, performs various functions of the computer apparatus 1000 and processes data by running or loading software programs and/or modules stored in the memory 1002 and calling data stored in the memory 1002, thereby integrally monitoring the computer apparatus 1000.

In this embodiment of the application, the processor 1001 in the computer device 1000 loads instructions corresponding to processes of one or more applications into the memory 1002, and the processor 1001 runs the applications stored in the memory 1002 according to the following steps, so as to implement various functions:

acquiring an audio set comprising a sample noiseless audio and a sample noise audio, and generating a sample mixed audio based on the sample noiseless audio and the sample noise audio;

In some optional embodiments, calculating a model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noisy audio, the sample mixed audio, the training noiseless audio, and the training noisy audio includes:

calculating the noise-free frequency training loss of the audio enhancement model to be trained according to the noise-free audio and the training noise-free audio of the sample;

In some optional embodiments, calculating a model loss corresponding to the audio enhancement model to be trained based on the noiseless audio training loss, the noisy audio training loss, and the mixed audio training loss includes:

In some optional embodiments, performing noise separation on the sample mixed audio through an audio enhancement model to be trained to obtain training noiseless audio and training noise audio, including:

mapping the frequency domain representation of the sample mixed audio to initial characteristics of the sample mixed audio through a coding layer of an audio enhancement model to be trained;

In some optional embodiments, the audio enhancement method provided in the embodiments of the present invention further includes:

acquiring audio to be enhanced;

and performing audio enhancement on the audio to be enhanced by adopting the trained audio enhancement model to obtain an enhanced audio corresponding to the audio to be enhanced.

In some optional embodiments, obtaining the audio to be enhanced includes:

determining real part characteristics and imaginary part characteristics corresponding to the frequency domain representations from the frequency domain representations of the audio to be enhanced, and fusing the real part characteristics and the imaginary part characteristics to obtain the characteristics to be processed corresponding to the audio to be enhanced;

adopt the audio frequency enhancement model after the training to treat the enhancement audio frequency and carry out audio frequency enhancement, obtain the enhancement back audio frequency that treats the enhancement audio frequency and correspond, include:

In some optional embodiments, performing time domain conversion based on the enhanced real part feature and the enhanced imaginary part feature to obtain an enhanced audio corresponding to the audio to be enhanced, includes:

performing time domain conversion based on the enhanced real part features and the enhanced imaginary part features to obtain enhanced sub-audio with the same quantity as the preset quantity;

According to the scheme, an audio set comprising sample noiseless audio and sample noise audio can be obtained, sample mixed audio is generated based on the sample noiseless audio and the sample noise audio, the sample mixed audio is subjected to noise separation through an audio enhancement model to be trained to obtain training noiseless audio and training noise audio, model loss corresponding to the audio enhancement model to be trained is calculated according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, the audio enhancement model to be trained is adjusted based on the model loss to obtain the trained audio enhancement model, and audio enhancement is performed on the audio to be enhanced through the trained audio enhancement model; the training precision of the audio enhancement model can be improved, the audio enhancement effect of the audio enhancement model is improved, and higher audio definition and perception quality can be obtained after audio enhancement processing is carried out on the audio.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Optionally, as shown in fig. 10, the computer device 1000 further includes: touch-sensitive display screen 1003, radio frequency circuit 1004, audio circuit 1005, input unit 1006 and power 1007. The processor 1001 is electrically connected to the touch display screen 1003, the radio frequency circuit 1004, the audio circuit 1005, the input unit 1006, and the power supply 1007, respectively. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 10 is not intended to be limiting of computer devices and may include more or fewer components than those shown, or some of the components may be combined, or a different arrangement of components.

The touch display screen 1003 can be used for displaying a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface. The touch display screen 1003 may include a display panel and a touch panel. The display panel may be used, among other things, to display information entered by or provided to a user and various graphical user interfaces of the computer device, which may be made up of graphics, text, icons, video, and any combination thereof. Alternatively, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations of a user on or near the touch panel (for example, operations of the user on or near the touch panel using any suitable object or accessory such as a finger, a stylus pen, and the like), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1001, and can receive and execute commands sent by the processor 1001. The touch panel may cover the display panel, and when the touch panel detects a touch operation thereon or nearby, the touch panel transmits the touch operation to the processor 1001 to determine the type of the touch event, and then the processor 1001 provides a corresponding visual output on the display panel according to the type of the touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display 1003 to implement input and output functions. However, in some embodiments, the touch panel and the touch panel can be implemented as two separate components to perform the input and output functions. That is, the touch display 1003 may also be used as a part of the input unit 1006 to implement an input function.

The rf circuit 1004 may be used for transceiving rf signals to establish wireless communication with a network device or other computer device through wireless communication, and to transmit and receive signals to and from the network device or other computer device.

Audio circuitry 1005 may be used to provide an audio interface between a user and a computer device through speakers and microphones. The audio circuit 1005 may transmit the electrical signal converted from the received audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts a collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 1005, and outputs the audio data to the processor 1001 for processing, for example, to another computer device via the radio frequency circuit 1004, or outputs the audio data to the memory 1002 for further processing. The audio circuitry 1005 may also include an earbud jack to provide communication of peripheral headphones with the computer device.

The input unit 1006 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 1007 is used to power the various components of the computer device 1000. Optionally, the power supply 1007 may be logically connected to the processor 1001 through a power management system, so as to implement functions of managing charging, discharging, power consumption management, and the like through the power management system. The power supply 1007 may also include any component including one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown in fig. 10, the computer device 1000 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which are not described in detail herein.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

As can be seen from the above, the computer device provided in this embodiment obtains an audio set including sample noiseless audio and sample noise audio, generates sample mixed audio based on the sample noiseless audio and the sample noise audio, performs noise separation on the sample mixed audio through an audio enhancement model to be trained to obtain training noiseless audio and training noise audio, calculates a model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio, adjusts the audio enhancement model to be trained based on the model loss to obtain the trained audio enhancement model, and performs audio enhancement on the audio to be enhanced through the trained audio enhancement model; the training precision of the audio enhancement model can be improved, the audio enhancement effect of the audio enhancement model is improved, and higher audio definition and perception quality can be obtained after audio enhancement processing is carried out on the audio.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a computer-readable storage medium, in which a plurality of computer programs are stored, and the computer programs can be loaded by a processor to execute the steps in any one of the audio enhancement methods provided by the embodiments of the present application. For example, the computer program may perform the steps of:

carrying out noise separation on the sample mixed audio through an audio enhancement model to be trained to obtain a training noiseless audio and a training noise audio;

In some optional embodiments, the noise separating the sample mixed audio through the audio enhancement model to be trained to obtain the training noiseless audio and the training noisy audio includes:

acquiring audio to be enhanced;

In some optional embodiments, obtaining the audio to be enhanced includes:

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the computer program stored in the storage medium can execute the steps in any audio enhancement method provided in the embodiments of the present application, the beneficial effects that can be achieved by any audio enhancement method provided in the embodiments of the present application can be achieved, and detailed descriptions are omitted here for the foregoing embodiments.

The foregoing detailed description has provided a method, an apparatus, a storage medium, and a computer device for audio enhancement according to embodiments of the present application, and specific examples have been applied in the present application to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of audio enhancement, comprising:

acquiring an audio set comprising sample noiseless audio and sample noise audio, and generating sample mixed audio based on the sample noiseless audio and the sample noise audio;

2. The audio enhancement method of claim 1, wherein the calculating a model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noisy audio, the sample mixed audio, the training noiseless audio and the training noisy audio comprises:

calculating the noise-free frequency training loss of the audio enhancement model to be trained according to the sample noise-free audio and the training noise-free audio;

3. The audio enhancement method of claim 2, wherein the calculating a model loss corresponding to the audio enhancement model to be trained based on the noiseless audio training loss, the noisy audio training loss, and the mixed audio training loss comprises:

acquiring a first loss weight corresponding to the noise-free audio training loss, a second loss weight corresponding to the noise audio training loss and a third loss weight corresponding to the mixed audio training loss;

4. The audio enhancement method of claim 1, wherein the noise separating the sample mixed audio by the audio enhancement model to be trained to obtain a training noiseless audio and a training noisy audio comprises:

mapping the frequency domain representation of the sample mixed audio to initial features of the sample mixed audio through an encoding layer of an audio enhancement model to be trained;

decoding and mapping each spectrum characteristic and the spectrum connection characteristic through a decoding layer of the audio enhancement model to be trained to obtain frequency domain representation of training noiseless audio and frequency domain representation of training noise audio;

5. The audio enhancement method of any of claims 1-4, wherein the method further comprises:

acquiring the audio to be enhanced;

6. The audio enhancement method of claim 5, wherein the obtaining the audio to be enhanced comprises:

acquiring a time domain representation of the audio to be enhanced, and converting the time domain representation into frequency domain representations with preset number, wherein the preset number is at least 2;

the audio enhancement is performed on the audio to be enhanced by adopting the trained audio enhancement model to obtain the enhanced audio corresponding to the audio to be enhanced, and the method comprises the following steps:

performing noise separation processing on the features to be processed by using the trained audio enhancement model to obtain enhanced real part features and enhanced imaginary part features of the enhanced audio corresponding to the audio to be enhanced, wherein the number of the enhanced real part features and the enhanced imaginary part features is the same as the preset number;

7. The audio enhancement method of claim 6, wherein the time-domain transforming based on the enhanced real-part feature and the enhanced imaginary-part feature to obtain the enhanced audio corresponding to the audio to be enhanced comprises:

8. An audio enhancement device, comprising:

a sample obtaining unit, configured to obtain an audio set including a sample noiseless audio and a sample noisy audio, and generate a sample mixed audio based on the sample noiseless audio and the sample noisy audio;

the audio enhancement unit is used for carrying out noise separation on the sample mixed audio through an audio enhancement model to be trained to obtain training noiseless audio and training noise audio;

the loss calculation unit is used for calculating model loss corresponding to the audio enhancement model to be trained according to the sample noiseless audio, the sample noise audio, the sample mixed audio, the training noiseless audio and the training noise audio;

and the model adjusting unit is used for adjusting the audio enhancement model to be trained based on the model loss to obtain the trained audio enhancement model, so that the audio enhancement is performed on the audio to be enhanced through the trained audio enhancement model.

9. An electronic device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the steps of the audio enhancement method of any one of claims 1 to 7.

10. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the audio enhancement method according to any of claims 1 to 7.

11. A computer program product comprising computer programs or instructions, characterized in that the computer programs or instructions, when executed by a processor, implement the steps of the audio enhancement method of any of claims 1 to 7.