CN111883164A

CN111883164A - Model training method and device, electronic equipment and storage medium

Info

Publication number: CN111883164A
Application number: CN202010575643.4A
Authority: CN
Inventors: 张旭; 郑羲光; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-11-03
Anticipated expiration: 2040-06-22
Also published as: CN111883164B

Abstract

The training method comprises the steps of obtaining a plurality of sample data, and determining first characteristic information and amplitude characteristic information of noisy audio data in each sample data at each sampling point according to original audio data and noisy audio data in the sample data. And adjusting the first characteristic information to obtain target characteristic information, and training the model to be trained according to the target characteristic information and the corresponding amplitude characteristic information to obtain a trained model. When the trained model is used for denoising the audio data, the denoising strength of the model can be enhanced in a lower signal-to-noise ratio range, and the denoising strength of the model can be reduced in a higher signal-to-noise ratio range, so that the trained model can obtain different denoising effects for the audio data in different signal-to-noise ratio ranges.

Description

Model training method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a model training method and apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology, neural networks are increasingly applied to the processing of audio data, and compared with traditional algorithms, the neural networks can often obtain better effects and performances. In the denoising process of the audio data, firstly, a model to be trained is trained to obtain a trained model, and then the audio data is denoised through the trained model to obtain the audio data with the noise data removed.

In the training process of the model to be trained, sample data is random, the sample data can be audio data with low signal-to-noise ratio or audio data with high signal-to-noise ratio, and the model obtained by training the random sample data cannot be denoised by adopting different denoising strengths according to the audio data in different signal-to-noise ratio ranges.

Disclosure of Invention

The present disclosure provides a model training method, an apparatus, an electronic device, and a storage medium, to at least solve a problem that a model cannot perform denoising with different denoising strengths for audio data in different signal-to-noise ratio ranges in a denoising process of the audio data.

The technical scheme of the disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a model training method, including:

obtaining a plurality of sample data, wherein each sample data comprises original audio data and noisy audio data;

according to the original audio data and the noisy audio data, determining first characteristic information and amplitude characteristic information of the noisy audio data in each sample data at each sampling point, wherein the first characteristic information is used for representing signal-to-noise ratio information of the noisy audio data at the corresponding sampling point;

adjusting the first characteristic information to obtain target characteristic information; when the first characteristic information is less than or equal to a first threshold value, the first characteristic information is reduced, and when the first characteristic information is greater than or equal to a second threshold value, the first characteristic information is increased, wherein the first threshold value is less than the second threshold value;

inputting the amplitude characteristic information into a model to be trained to obtain second characteristic information output by the model to be trained;

obtaining a loss value of the model to be trained according to the second characteristic information and the target characteristic information;

and adjusting the model parameters of the model to be trained according to the loss value until the loss value is less than or equal to a preset threshold value, and taking the model to be trained as a trained model.

Optionally, the adjusting the first feature information to obtain target feature information includes:

when the first characteristic information is less than or equal to the first threshold, reducing the first characteristic information below a third threshold;

and when the first characteristic information is greater than or equal to the second threshold, increasing the first characteristic information to be more than a fourth threshold.

Optionally, the method further includes: when the first characteristic information is larger than the first threshold and smaller than the second threshold, adjusting the first characteristic information to be between a fifth threshold and a sixth threshold, wherein the fifth threshold is smaller than the sixth threshold.

Optionally, the adjusting the first feature information to obtain target feature information includes: and adjusting the first characteristic information through a mapping function to obtain the target characteristic information.

Optionally, the first characteristic information is a ratio between the amplitude value of the original audio data and the amplitude value of the noisy audio data corresponding to the sampling point, and the first characteristic information is less than or equal to 1.

Optionally, the determining, according to the original audio data and the noisy audio data, first feature information and amplitude feature information of the noisy audio data at each sampling point in each sample data includes:

converting original audio data in target sample data into a first frequency domain signal, and converting noisy audio data in the target sample data into a second frequency domain signal; the target sample data is any one of the plurality of sample data;

and determining first characteristic information and amplitude characteristic information of the noisy audio data in the target sample data at each sampling point according to the first frequency domain signal and the second frequency domain signal.

According to a second aspect of the embodiments of the present disclosure, there is provided a model training apparatus including:

a first obtaining module configured to obtain a plurality of sample data, each of the sample data including original audio data and noisy audio data;

the determining module is configured to determine first characteristic information and amplitude characteristic information of the noisy audio data in each sample data at each sampling point according to the original audio data and the noisy audio data, wherein the first characteristic information is used for representing signal-to-noise ratio information of the noisy audio data at the corresponding sampling point;

a first adjusting module configured to adjust the first feature information to obtain target feature information, wherein the first feature information is decreased when the first feature information is less than or equal to a first threshold, and the first feature information is increased when the first feature information is greater than or equal to a second threshold, and the first threshold is less than the second threshold;

the input module is configured to input the amplitude characteristic information into a model to be trained to obtain second characteristic information output by the model to be trained;

a second obtaining module configured to obtain a loss value of the model to be trained according to the second feature information and the target feature information;

and the second adjusting module is configured to adjust the model parameters of the model to be trained according to the loss value until the loss value is less than or equal to a preset threshold value, and the model to be trained is used as a trained model.

Optionally, the first adjusting module is specifically configured to reduce the first feature information to be below a third threshold when the first feature information is less than or equal to the first threshold; and when the first characteristic information is greater than or equal to the second threshold, increasing the first characteristic information to be more than a fourth threshold.

Optionally, the first adjusting module is further specifically configured to adjust the first feature information to a range from a fifth threshold to a sixth threshold when the first feature information is greater than the first threshold and smaller than the second threshold, where the fifth threshold is smaller than the sixth threshold.

Optionally, the first adjusting module is specifically configured to adjust the first feature information through a mapping function to obtain the target feature information.

Optionally, the determining module is specifically configured to convert original audio data in target sample data into a first frequency domain signal, and convert noisy audio data in the target sample data into a second frequency domain signal; the target sample data is any one of the plurality of sample data; and determining first characteristic information and amplitude characteristic information of the noisy audio data in the target sample data at each sampling point according to the first frequency domain signal and the second frequency domain signal.

According to a third aspect of the embodiments of the present disclosure, there is provided another electronic device, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the model training method as provided above in the first aspect of an embodiment of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a storage medium, wherein instructions that, when executed by a processor of an electronic device, enable the electronic device to perform the model training method as provided in the first aspect of embodiments of the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the model training method as provided by the first aspect of embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least has the following beneficial effects:

in this embodiment, a plurality of sample data are acquired, and according to original audio data and noisy audio data in the sample data, first characteristic information and amplitude characteristic information of the noisy audio data in each sample data at each sampling point are determined. And adjusting the first characteristic information to obtain target characteristic information, inputting the amplitude characteristic information into the model to be trained to obtain second characteristic information output by the model to be trained, and obtaining the loss value of the model to be trained according to the second characteristic information and the target characteristic information. And adjusting the model parameters of the model to be trained according to the loss value until the loss value is less than or equal to a preset threshold value, and taking the model to be trained as a trained model. When the trained model is used for denoising the audio data, the denoising strength of the model can be enhanced in a lower signal-to-noise ratio range, and the denoising strength of the model can be reduced in a higher signal-to-noise ratio range, so that the trained model can obtain different denoising effects for the audio data in different signal-to-noise ratio ranges.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a flow diagram illustrating a method of model training in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating another method of model training in accordance with an exemplary embodiment;

FIG. 3 is a graph of a first mapping function;

FIG. 4 is a graph of a second mapping function;

FIG. 5 is a block diagram illustrating a model training apparatus in accordance with an exemplary embodiment;

FIG. 6 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment;

FIG. 7 is a block diagram illustrating yet another electronic device in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating a model training method according to an exemplary embodiment, and referring to fig. 1, the model training method provided in this embodiment may be applied to denoising of audio data, so that the trained model may obtain different denoising strengths for audio data in different signal-to-noise ratio ranges. The model training method provided in this embodiment may be executed by a model training apparatus, where the model training apparatus is generally implemented in software and/or hardware, and the model training apparatus may be disposed in an electronic device, and the method may include:

step 101, obtaining a plurality of sample data.

In this embodiment, each sample data includes original audio data and noisy audio data, and the noisy audio data is synthesized from the original audio data and the noisy data. For a specific process of acquiring sample data, reference may be made to the prior art, which is not limited in this embodiment.

And step 102, determining first characteristic information and amplitude characteristic information of the noisy audio data in each sample data at each sampling point according to the original audio data and the noisy audio data.

The first characteristic information is used for representing signal-to-noise ratio information of the noisy audio data at a corresponding sampling point, and the amplitude characteristic information is an amplitude value of the noisy audio data corresponding to the sampling point.

In this embodiment, after the sample data is obtained, according to the original audio data and the noisy audio data included in the sample data, the first feature information and the corresponding amplitude feature information of the noisy audio data at each sampling point, that is, the signal-to-noise ratio and the amplitude value of the noisy audio data at each sampling point, may be determined. The first characteristic information may be a ratio between an amplitude value of the original audio data corresponding to the sampling point and an amplitude value of the noisy audio data, and the first characteristic information may be set to be less than or equal to 1.

Illustratively, step 102 may be implemented as follows:

converting original audio data in target sample data into a first frequency domain signal, and converting noisy audio data in the target sample data into a second frequency domain signal; the target sample data is any one of a plurality of sample data;

In this embodiment, the same fourier transform may be performed on the original audio data and the audio data with noise included in the sample data, and the original audio data and the audio data with noise are converted from time domain signals to frequency domain signals, so as to obtain information such as amplitude values and frequency values of the original audio data and the audio data with noise, respectively. For example, the original audio data may be converted into a first frequency domain signal by the same fourier transform, which may include sine wave components having frequencies of 3kHz, 5kHz, 7kHz, and 9kHz, and the band noise data may be converted into a second frequency domain signal, which may also include sine wave components having frequencies of 3kHz, 5kHz, 7kHz, and 9 kHz. Then, the ratio between the amplitude value of the first frequency domain signal and the amplitude value of the second frequency domain signal corresponding to each same frequency point (sampling point) is calculated to obtain corresponding first characteristic information, and meanwhile, the amplitude value of the second frequency domain signal is used as corresponding amplitude characteristic information. If the calculated sampling point is 3kHz, a ratio between the amplitude value of the first frequency domain signal and the amplitude value of the second frequency domain signal obtains first characteristic information 0.3, and at the same time, the amplitude value of the second frequency domain signal at 3kHz can be determined to be amplitude characteristic information corresponding to the calculated first characteristic information 0.3. Similarly, the first characteristic information 0.4, the first characteristic information 0.6 and the first characteristic information 0.9 respectively corresponding to the sampling points 5kHz, 7kHz and 9kHz are sequentially calculated, and the amplitude characteristic information respectively corresponding to each piece of first characteristic information is determined.

When it needs to be explained, when the fourier transform is performed on the original audio data and the data with noise, the parameters used for the transform may be selected according to the requirement, and the specific fourier transform process may refer to the prior art, which is not limited in this embodiment.

And 103, adjusting the first characteristic information to obtain target characteristic information.

When the first characteristic information is smaller than or equal to a first threshold value, the first characteristic information is reduced, and when the first characteristic information is larger than or equal to a second threshold value, the first characteristic information is increased, wherein the first threshold value is smaller than the second threshold value.

In this embodiment, after the first characteristic information and the amplitude characteristic information of each sampling point are determined, the first characteristic information may be adjusted to obtain target characteristic information corresponding to the first characteristic information. The first threshold and the second threshold correspond to different signal-to-noise ratios respectively. With reference to steps 101 to 102, the first feature information represents a signal-to-noise ratio range of the noisy audio data, and the larger the first feature information is, the larger the signal-to-noise ratio of the noisy audio data is, and the smaller the first feature information is, the lower the signal-to-noise ratio of the noisy audio data is. When the first characteristic information is less than or equal to the first threshold, the signal-to-noise ratio of the noisy audio data is low, and when the first characteristic information is greater than or equal to the second threshold, the signal-to-noise ratio of the noisy audio data is high.

For example, the first threshold may be set to 0.3 (the signal-to-noise ratio of the noisy audio data is low), and when the first feature information is 0.3, the first feature information may be reduced so that the reduced first feature information (i.e., the target feature information) is 0.2. Similarly, the second threshold may be set to 0.9 (the signal-to-noise ratio of the noisy audio data is higher), and when the first feature information is 0.9, the first feature information is increased so that the increased first feature information (i.e., the target feature information) is 0.95. The above is merely an exemplary example, and the specific values of the first threshold and the second threshold, and the decreasing and increasing methods of the first feature information may be set according to requirements, which is not limited in this embodiment.

And 104, inputting the amplitude characteristic information into the model to be trained to obtain second characteristic information output by the model to be trained.

And 105, obtaining a loss value of the model to be trained according to the second characteristic information and the target characteristic information.

And 106, adjusting model parameters of the model to be trained according to the loss value until the loss value is less than or equal to a preset threshold value, and taking the model to be trained as the trained model.

For steps 104 to 106, when the model to be trained is trained according to the target feature information and the corresponding amplitude feature information, the amplitude feature information of each sampling point may be used as an input of the model to be trained, and the target feature information corresponding to each sampling point (i.e., the target feature information obtained by adjusting the first feature information of each sampling point) is used as a target of the model to be trained, so as to train the model to be trained.

For example, amplitude characteristic information corresponding to the sampling point 3kHz may be input into the model to be trained first, and second characteristic information corresponding to the sampling point 3kHz and output by the model to be trained is obtained. And then calculating to obtain a loss value of the model to be trained according to the target characteristic information and the second characteristic information corresponding to the sampling point 3kHz, and adjusting model parameters of the model to be trained according to the loss value to complete one-time training of the model to be trained. And by analogy, training the model to be trained according to the amplitude characteristic information and the target characteristic information respectively corresponding to the sampling points of 5kHz, 7kHz and 9kHz in sequence. Similarly, training the model to be trained for multiple times according to the target characteristic information and the amplitude characteristic information corresponding to each sampling point in each sample data until the loss value is less than or equal to the preset threshold value, determining that the training is finished, and taking the model to be trained as the trained model. The specific value of the preset threshold may be set according to a requirement, and the calculation method of the loss value and the adjustment process of the model parameter may refer to the prior art, which is not limited in this embodiment.

In this embodiment, after the trained model is obtained, the audio data may be denoised by the trained model to obtain the audio data without the noise data. With reference to the foregoing example, in the denoising process, first, fourier transform may be performed on the audio data to obtain a frequency domain signal of the audio data, and an amplitude value (amplitude characteristic information) and a phase value corresponding to each frequency value (sampling point) in the frequency domain signal of the audio data are determined. And taking the amplitude value corresponding to each frequency value as the input of the trained model, and obtaining second characteristic information corresponding to each frequency value through the trained model.

With reference to step 102, the first characteristic information is a ratio between the amplitude value of the first frequency domain signal and the amplitude value of the second frequency domain signal in the sample data, and the amplitude value of the first frequency domain signal is a product between the first characteristic information and the amplitude value of the second frequency domain signal. From this, the amplitude value of the frequency domain signal of the audio data after the noise data is removed is the product between the second feature information and the amplitude value of the frequency domain signal of the audio data. Therefore, after the second characteristic information corresponding to each frequency value is determined, the product of the second characteristic information and the corresponding amplitude value is the amplitude value corresponding to each frequency value.

Meanwhile, in the fourier transform process, the original audio data and the noise-containing audio data in the sample data are respectively subjected to the same fourier transform, that is, when the frequency values (sampling points) are the same, the phase value in the first frequency domain signal is the same as the phase value in the second frequency domain signal. Similarly, the phase value corresponding to each frequency value in the frequency domain signal of the audio data after the noise data is removed is the same as the phase value in the audio data. And taking the phase value of the audio data as the phase value of the audio data after the noise data is removed, and performing inverse Fourier transform according to each frequency value in the frequency domain signal of the audio data after the noise data is removed and the amplitude value and the phase value corresponding to each frequency value respectively to obtain the audio data after the noise data is removed.

In practical application, the first characteristic information less than or equal to the first threshold is reduced, and the first characteristic information greater than or equal to the second threshold is increased, so that the signal-to-noise ratio range of the noisy audio data can be adjusted. The trained model is obtained by training the to-be-trained model through the adjusted first characteristic information (namely the target characteristic information), so that different denoising strengths can be obtained when the trained model denoises audio data in different signal-to-noise ratio ranges. With reference to the above example, if the first feature information is 0.3 (less than or equal to the first threshold value 0.3), when the model obtained through training of the first feature information 0.3 is used to perform denoising on the audio data with the first feature information of 0.3, the obtained second feature information is close to or equal to 0.3. When the model obtained through training of the target characteristic information 0.2 is used for denoising the audio data with the first characteristic information of 0.3, the obtained second characteristic information is close to or equal to 0.2, the amplitude value obtained through calculation of the second characteristic information 0.2 is lower than the amplitude value obtained through calculation of the second characteristic information 0.3, the amplitude value of the audio data after noise data is removed is reduced, and the denoising intensity of the trained model is increased. Similarly, if the first feature information is 0.9 (greater than or equal to the second threshold value 0.9), when the model trained by the first feature information 0.9 is used for denoising the audio data with the first feature information of 0.9, the obtained second feature information is close to or equal to 0.9. When the model obtained through training of the target characteristic information 0.95 is used for denoising the audio data with the first characteristic information of 0.9, the obtained second characteristic information is close to or equal to 0.95, the amplitude value obtained through calculation of the second characteristic information 0.95 is higher than the amplitude value obtained through calculation of the second characteristic information 0.9, the amplitude value of the audio data after noise data is removed is increased, and the denoising intensity of the trained model is reduced. That is, the denoising strength of the model can be enhanced in a lower signal-to-noise ratio range, and the denoising strength of the model can be reduced in a higher signal-to-noise ratio range.

In summary, in this embodiment, a plurality of sample data are obtained, and according to the original audio data and the noisy audio data in the sample data, the first characteristic information and the amplitude characteristic information of the noisy audio data in each sample data at each sampling point are determined. And adjusting the first characteristic information to obtain target characteristic information, inputting the amplitude characteristic information into the model to be trained to obtain second characteristic information output by the model to be trained, and obtaining the loss value of the model to be trained according to the second characteristic information and the target characteristic information. And adjusting the model parameters of the model to be trained according to the loss value until the loss value is less than or equal to a preset threshold value, and taking the model to be trained as a trained model. When the trained model is used for denoising the audio data, the denoising strength of the model can be enhanced in a lower signal-to-noise ratio range, and the denoising strength of the model can be reduced in a higher signal-to-noise ratio range, so that the trained model can obtain different denoising effects for the audio data in different signal-to-noise ratio ranges.

FIG. 2 is a flow diagram illustrating another model training method in accordance with an exemplary embodiment, which, with reference to FIG. 2, may include:

step 201, obtaining a plurality of original audio data.

In this embodiment, in the process of acquiring a plurality of sample data, a plurality of original audio data may be acquired first. Specifically, the electronic device may directly receive a plurality of original audio data input by the user, and the original audio data may be, for example, music or voice of a fixed length. The method for acquiring the original audio data and the specific type of the original audio data may be set according to requirements, which is not limited in this embodiment.

Step 202, adding noise data to the target audio data according to a preset rule to obtain noise data corresponding to the target audio data, and taking the target audio data and the corresponding noise data as sample data.

Wherein the target audio data is any one of the plurality of original audio data.

In this embodiment, after the original audio data is obtained, the original audio data may be processed to obtain noisy audio data corresponding to the original audio data. Specifically, noise data may be added to the original audio data according to a preset rule to obtain noisy audio data corresponding to the original audio data, and the original audio data and the noisy audio data are used as sample data. The noise data may be, for example, fixed-length speech. For example, after music (original audio data) is acquired, speech (noise data) may be synthesized into the music to obtain noisy audio data, and during the synthesis, the signal-to-noise ratio of the noisy audio data may reach a preset value (e.g., 20 db) according to a preset rule. The process of adding the noise data to the original audio data may be set according to the requirement, and this embodiment does not limit this. Noise data are added to the original audio data through a preset rule, the data with the noise data are obtained, sample data are obtained, the sample data meeting training requirements can be obtained, and training efficiency is improved.

Step 203, determining first characteristic information and amplitude characteristic information of the noisy audio data in each sample point according to the original audio data and the noisy audio data.

And 204, adjusting the first characteristic information through a mapping function to obtain target characteristic information.

In this embodiment, the first feature information may be adjusted through a mapping function, so as to obtain target feature information corresponding to the first feature information.

Optionally, when the first characteristic information is less than or equal to the first threshold, the first characteristic information may be decreased to below a third threshold;

when the first feature information is equal to or greater than the second threshold, the first feature information may be increased to be equal to or greater than a fourth threshold.

In this embodiment, the first feature information that is less than or equal to the first threshold may be directly reduced to be less than the third threshold, so that the trained model may have a stronger denoising effect when aiming at audio data with a lower signal-to-noise ratio. And increasing the first characteristic information which is larger than or equal to the second threshold to be more than a fourth threshold so that the trained model has lower denoising effect when aiming at audio data with higher signal-to-noise ratio.

For example, by a first mapping function:

the first characteristic information is adjusted. As shown in fig. 3, fig. 3 is a graph of a first mapping function, and the first feature information smaller than 0.6 (first threshold) can be reduced to 0.1 (third threshold) or less and the first feature information larger than 0.8 (second threshold) can be increased to 0.9 (fourth threshold) or more by the first mapping function.

For another example, by a second mapping function:

the first characteristic information is adjusted. As shown in fig. 4, fig. 4 is a graph of a second mapping function, by which first characteristic information equal to or less than 0.5 (a first threshold value) can be adjusted to 0; the first feature information equal to or greater than 0.9 (second threshold) is adjusted to 1.

In this embodiment, when the first feature information is less than or equal to the first threshold, the first feature information is directly reduced to be less than or equal to the third threshold, and thus, when the signal-to-noise ratio is low, smaller first feature information (target feature information) can be obtained. For example, the first feature information less than or equal to 0.5 is adjusted to be 0, and the model to be trained is trained through the smaller first feature information (target feature information), so that when the model obtained through training is used for denoising the audio data with a lower signal-to-noise ratio, the intensity of the audio data in the audio data with noise can be reduced, the denoising intensity can be increased, and the denoising effect can be improved. For example, if the first feature information is 0.3, when the model trained by the first feature information 0.3 is used for denoising audio data with the first feature information of 0.3 (the signal to noise ratio is low, and the noise data is large), the obtained second feature information is close to 0.3. When the model obtained through training of the adjusted first feature information (target feature information) 0 is used for denoising the audio data with the first feature information of 0.3, the obtained second feature information is close to 0, and the amplitude value obtained through calculation of the second feature information 0 is lower than the amplitude value obtained through calculation of the second feature information 0.3, so that the amplitude of the audio data obtained through calculation and with the noise data removed is lower. Therefore, when the audio data with a lower signal-to-noise ratio (i.e. with larger noise data) is denoised, a larger denoising intensity can be obtained, and the intensity of the audio data with the noise data removed can be reduced.

Similarly, when the first characteristic information is greater than or equal to the second threshold, the first characteristic information is directly increased to be higher than the fourth threshold, and larger target characteristic information can be obtained when the signal-to-noise ratio is higher. For example, the first feature information greater than or equal to 0.9 is adjusted to 1, and the model to be trained is trained through the larger first feature information (target feature information), so that when the model obtained through training is used for denoising the audio data with higher signal-to-noise ratio, the audio data in the noisy audio data can be enhanced, so as to reduce the denoising strength and improve the denoising effect. For example, if the first feature information is 0.9, when the model trained by the first feature information 0.9 is used to denoise audio data with the first feature information of 0.9 (the signal-to-noise ratio is high, and the noise data is small), the obtained second feature information is close to 0.9. When the model obtained through training of the adjusted first feature information 1 (target feature information) is used for denoising the audio data with the first feature information of 0.9, the obtained second feature information is close to 1, and the amplitude value obtained through calculation of the second feature information 1 is higher than the amplitude value obtained through calculation of the first feature information 0.9, so that the amplitude of the audio data obtained through calculation and with the noise data removed is higher. Therefore, when the audio data with higher signal-to-noise ratio (namely, smaller noise data) is denoised, the denoising intensity can be reduced, and the intensity of the audio data after the noise data is removed can be improved.

Optionally, when the first characteristic information is greater than the first threshold and smaller than the second threshold, the first characteristic information is adjusted to a range from a fifth threshold to a sixth threshold, and the fifth threshold is smaller than the sixth threshold.

In this embodiment, when the first characteristic information is greater than the first threshold and smaller than the second threshold, the first characteristic information may be adjusted to be between a fifth threshold and a sixth threshold. By adjusting the first characteristic information between the first threshold and the second threshold, the overall denoising effect of the model can be adjusted, and the applicability of the model is improved. In conjunction with fig. 3, the first feature information greater than 0.6 (first threshold) and less than 0.8 (second threshold) may be adjusted to be between 0.1 (fifth threshold) and 0.9 (sixth threshold). And in conjunction with fig. 4, the first characteristic information greater than 0.5 (first threshold) and less than 0.9 (second threshold) may be adjusted to be between 0 (fifth threshold) and 1 (sixth threshold).

In practical application, when the first characteristic information is adjusted through the mapping function, the first characteristic information in the whole range can be quickly adjusted, so that the adjustment efficiency of the first characteristic information is improved. It should be noted that the method for adjusting the first characteristic information may include, but is not limited to, a method by a mapping function.

And step 205, inputting the amplitude characteristic information into the model to be trained to obtain second characteristic information output by the model to be trained.

And step 206, obtaining a loss value of the model to be trained according to the second characteristic information and the target characteristic information.

And step 207, adjusting model parameters of the model to be trained according to the loss value until the loss value is less than or equal to a preset threshold value, and taking the model to be trained as the trained model.

In summary, in this embodiment, different adjustments may be made to the first feature information in different signal-to-noise ratio ranges to obtain target feature information located in different ranges, and the model to be trained is trained through the target feature information in different ranges, so that the trained model can obtain different denoising effects for audio data in different signal-to-noise ratio ranges.

Optionally, when the first feature information is greater than or equal to 1, the first feature information is adjusted to 1.

In this embodiment, after the first feature information is determined, the first feature information may be adjusted to 1 when the first feature information is greater than or equal to 1. For example, after determining the first feature information, the following function may be performed on the first feature information:

in combination with step 102, mask is the first characteristic information, MagX is the amplitude value in the first frequency domain signal, and MagY is the amplitude value in the second frequency domain signal. When the first characteristic information is greater than or equal to 1, the first characteristic information is adjusted to 1, so that the situation that the amplitude in the second frequency domain signal is smaller than the amplitude in the first frequency domain signal due to phase cancellation and the like when noisy audio data is generated according to the original audio data and the noise data, and the first characteristic information is greater than 1 can be avoided. And then, the problems that when the first characteristic information is larger than 1, a larger target (target characteristic information) appears in the model training process, so that the model convergence is poorer and the training time is longer in the training process can be solved.

Referring to FIG. 5, FIG. 5 is a block diagram illustrating a model training apparatus according to an exemplary embodiment. The model training apparatus 500 may be applied to denoising of audio data, and may include: a first obtaining module 501, a determining module 502, a first adjusting module 503, an input module 504, a second obtaining module 505, and a second adjusting module 506.

The first obtaining module 501 is configured to obtain a plurality of sample data, each sample data including original audio data and noisy audio data.

The determining module 502 is configured to determine, according to the original audio data and the noisy audio data, first characteristic information and amplitude characteristic information of the noisy audio data at each sampling point in each sample data, where the first characteristic information is used to represent signal-to-noise ratio information of the noisy audio data at the corresponding sampling point.

The first adjusting module 503 is configured to adjust the first feature information to obtain the target feature information, wherein the first feature information is decreased when the first feature information is less than or equal to a first threshold, and the first feature information is increased when the first feature information is greater than or equal to a second threshold, and the first threshold is less than the second threshold.

The input module 504 is configured to input the amplitude feature information into the model to be trained, and obtain second feature information output by the model to be trained.

The second obtaining module 505 is configured to obtain a loss value of the model to be trained according to the second feature information and the target feature information;

the second adjusting module 506 is configured to adjust the model parameters of the model to be trained according to the loss value, and when the loss value is less than or equal to a preset threshold, the model to be trained is taken as the trained model.

Optionally, the first adjusting module 503 is specifically configured to reduce the first characteristic information below a third threshold when the first characteristic information is less than or equal to the first threshold; and when the first characteristic information is greater than or equal to the second threshold value, increasing the first characteristic information to be more than or equal to a fourth threshold value.

Optionally, the first adjusting module 503 is further specifically configured to adjust the first characteristic information to a range from a fifth threshold to a sixth threshold when the first characteristic information is greater than the first threshold and smaller than the second threshold, and the fifth threshold is smaller than the sixth threshold.

Optionally, the first adjusting module 503 is specifically configured to adjust the first feature information through a mapping function to obtain the target feature information.

Optionally, the first characteristic information is a ratio between an amplitude value of the original audio data corresponding to the sampling point and an amplitude value of the data with noise, and the first characteristic information is less than or equal to 1.

Optionally, the determining module 502 is specifically configured to convert the original audio data in the target sample data into a first frequency domain signal, and convert the noisy audio data in the target sample data into a second frequency domain signal; the target sample data is any one of a plurality of sample data; and determining first characteristic information and amplitude characteristic information of the noisy audio data in the target sample data at each sampling point according to the first frequency domain signal and the second frequency domain signal.

Referring to fig. 6, fig. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment. The electronic device 600 includes:

a processor 601.

A memory 602 for storing instructions executable by the processor 601.

Wherein the processor 601 is configured to execute executable instructions stored by the memory 602 to implement the model training method in the embodiment shown in fig. 1 or fig. 2.

In an exemplary embodiment, there is also provided a storage medium comprising instructions, such as a memory 602 comprising instructions, executable by a processor 601 of the processor 600 to perform the model training method in the embodiments shown in fig. 1 or fig. 2.

Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product containing instructions is also provided, which when run on a computer, causes the computer to perform a model training method as in the embodiment shown in fig. 1 or fig. 2.

Referring to fig. 7, fig. 7 is a block diagram illustrating yet another electronic device according to an example embodiment, the electronic device 700 may include one or more of the following components: processing component 702, memory 704, power component 706, multimedia component 708, audio component 710, input/output (I/O) interface 713, sensor component 714, and communications component 716.

The processing component 702 generally controls overall operation of the device 700, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 702 may include one or more processors 720 to execute instructions to perform all or a portion of the steps of the model training method described above. Further, the processing component 702 may include one or more modules that facilitate interaction between the processing component 702 and other components. For example, the processing component 702 may include a multimedia module to facilitate interaction between the multimedia component 708 and the processing component 702.

The memory 704 is configured to store various types of data to support operations at the apparatus 700. Examples of such data include instructions for any application or method operating on device 700, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 704 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 706 provides power to the various components of the device 700. The power components 706 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 700.

The multimedia component 708 includes a screen that provides an output interface between the device 700 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 708 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 700 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 710 is configured to output and/or input audio signals. For example, audio component 710 includes a Microphone (MIC) configured to receive external audio signals when apparatus 700 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 704 or transmitted via the communication component 716. In some embodiments, audio component 710 also includes a speaker for outputting audio signals.

The I/O interface 713 provides an interface between the processing component 702 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 714 includes one or more sensors for providing status assessment of various aspects of the apparatus 700. For example, sensor assembly 714 may detect an open/closed state of device 700, the relative positioning of components, such as a display and keypad of device 700, sensor assembly 714 may also detect a change in position of device 700 or a component of device 700, the presence or absence of user contact with device 700, orientation or acceleration/deceleration of device 700, and a change in temperature of device 700. The sensor assembly 714 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 714 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 714 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 716 is configured to facilitate wired or wireless communication between the apparatus 700 and other devices. The apparatus 700 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 716 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 716 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described model training methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 704 comprising instructions, executable by the processor 720 of the apparatus 700 to perform the model training method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein the adjusting the first feature information to obtain target feature information comprises:

3. The method of claim 2, further comprising:

when the first characteristic information is larger than the first threshold and smaller than the second threshold, adjusting the first characteristic information to be between a fifth threshold and a sixth threshold, wherein the fifth threshold is smaller than the sixth threshold.

4. The method of claim 1, wherein the adjusting the first feature information to obtain target feature information comprises:

and adjusting the first characteristic information through a mapping function to obtain the target characteristic information.

5. The method according to claim 1, wherein the first characteristic information is a ratio between an amplitude value of the original audio data and an amplitude value of the noisy audio data corresponding to the sampling point, and the first characteristic information is less than or equal to 1.

6. The method of claim 5, wherein determining first characteristic information and amplitude characteristic information of the noisy audio data at respective sample points in each sample data from the original audio data and the noisy audio data comprises:

7. A model training apparatus, comprising:

a first adjusting module configured to adjust the first feature information to obtain target feature information; when the first characteristic information is less than or equal to a first threshold value, the first characteristic information is reduced, and when the first characteristic information is greater than or equal to a second threshold value, the first characteristic information is increased, wherein the first threshold value is less than the second threshold value;

8. The apparatus according to claim 7, wherein the first adjusting module is specifically configured to decrease the first feature information below a third threshold when the first feature information is less than or equal to the first threshold; and when the first characteristic information is greater than or equal to the second threshold, increasing the first characteristic information to be more than a fourth threshold.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the model training method of any one of claims 1-6.

10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the model training method of any one of claims 1-6.