CN114758669B

CN114758669B - Audio processing model training method and device, audio processing method and device and electronic equipment

Info

Publication number: CN114758669B
Application number: CN202210659913.9A
Authority: CN
Inventors: 钟雨崎; 凌明; 杨作兴; 艾国
Original assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Current assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-09-02
Anticipated expiration: 2042-06-13
Also published as: CN114758669A

Abstract

The embodiment of the invention provides a method and a device for training an audio processing model, an audio processing method and device and electronic equipment. The method comprises the following steps: obtaining training samples, the training samples including a first audio signal and a second audio signal, wherein the second audio signal includes: a mixed acquisition signal of a first playing sound of the third audio signal and a second playing sound of the first audio signal; inputting the first audio signal and the second audio signal into an audio processing model to obtain a fourth audio signal; determining a model loss value of the audio processing model based on a difference between a fourth audio signal and a fifth audio signal, wherein the fifth audio signal is obtained by performing audio 3A processing on the second audio signal, and the audio 3A processing includes eliminating a first playing sound in the second audio signal; and adjusting model parameters of the audio processing model based on the model loss value so that the model loss value is lower than a preset threshold value. The implementation mode of the invention can reduce the flow, the calculation amount and the system complexity.

Description

Audio processing model training method and device, audio processing method and device and electronic equipment

Technical Field

The invention belongs to the technical field of audio processing, and particularly relates to a training method and a training device of an audio processing model, an audio processing method and a device of the audio processing model, and electronic equipment.

Background

Audio 3A processing includes Acoustic Echo Cancellation (AEC), Automatic Noise Suppression (ANS), and Automatic Gain Control (AGC).

Fig. 1 is an exemplary schematic diagram of a prior art audio 3A process. Each audio processing module (AEC, ANS, or AGC) performs a short-time fourier transform (STFT) to convert the audio signal into a frequency domain signal, and performs an inverse short-time fourier transform to restore the frequency domain signal to the audio signal. The conversion back and forth between the frequency domain and the time domain results in a tedious audio 3A process. Moreover, STFT involves an e-power operation, which also results in a large amount of operation and system complexity for audio 3A processing.

Disclosure of Invention

The embodiment of the invention provides a method and a device for training an audio processing model, an audio processing method and device and electronic equipment.

The technical scheme of the embodiment of the invention is as follows:

a method of training an audio processing model, comprising:

obtaining training samples, the training samples comprising a first audio signal and a second audio signal, wherein the second audio signal comprises: a mixed acquisition signal of a first playing sound of a third audio signal and a second playing sound of the first audio signal;

inputting the first audio signal and the second audio signal into an audio processing model to obtain a fourth audio signal;

determining a model loss value of the audio processing model based on a difference between the fourth audio signal and a fifth audio signal, wherein the fifth audio signal is obtained by performing audio 3A processing on the second audio signal, the audio 3A processing includes canceling the second playing sound in the second audio signal;

adjusting model parameters of the audio processing model based on the model loss value such that the model loss value is below a preset threshold.

In an exemplary embodiment, the audio processing model comprises an encoding sub-model, a coupling sub-model and a decoding sub-model; the inputting the first audio signal and the second audio signal into an audio processing model to obtain a fourth audio signal includes:

inputting the first audio signal and the second audio signal into the coding submodel to obtain a first audio characteristic coded according to the first audio signal and a second audio characteristic coded according to the second audio signal;

splicing the first audio feature and the second audio feature to obtain a spliced audio feature;

inputting the spliced audio features into the coupling submodel to obtain coupled audio features;

and inputting the coupled audio features into the decoding submodel to obtain the fourth audio signal decoded according to the coupled audio features.

In an exemplary embodiment, further comprising:

playing the third audio signal with a high fidelity audio device to produce the first played sound while playing the first audio signal with a speaker to produce the second played sound;

and collecting the first playing sound and the second playing sound by using the microphone to obtain the second audio signal.

In an exemplary embodiment, the third audio signal is a clean speech signal and the first audio signal is an interfering audio signal of the clean speech signal.

In an exemplary embodiment, said performing audio 3A processing on the second audio signal further comprises:

performing background noise suppression processing on the second audio signal from which the second playback sound is eliminated;

and performing automatic gain control on the second audio signal after the background noise suppression processing.

In an exemplary embodiment, the encoding submodel, the coupling submodel, and the decoding submodel each comprise a deep learning module, the deep learning comprising at least one convolutional neural network and at least one deep neural network; or

The encoding submodel and the coupling submodel comprise encoders in a Transformer model, and the decoding submodel comprises decoders in the Transformer model.

An audio processing method, comprising:

acquiring a trained audio processing model, wherein the trained audio processing model is obtained by training according to the training method of the audio processing model;

inputting a sixth audio signal and a seventh audio signal into the audio processing model, wherein the seventh audio signal comprises: a mixed acquisition signal of the speaker's voice and a third playing sound of the sixth audio signal; the sixth audio signal is an interference audio signal of the voice of the speaker;

receiving, from the audio processing model, an eighth audio signal after performing audio processing on the sixth audio signal and a seventh audio signal.

In an exemplary embodiment, when the sixth audio signal is played by a speaker of an edge device to generate the third played sound, the microphone of the edge device is used to mix and collect the speaker's voice and the third played sound to obtain the seventh audio signal.

An apparatus for training an audio processing model, comprising:

an acquisition module configured to acquire training samples, the training samples comprising a first audio signal and a second audio signal, wherein the second audio signal comprises: a mixed acquisition signal of a first playing sound of a third audio signal and a second playing sound of the first audio signal;

an input module configured to input the first audio signal and the second audio signal into an audio processing model, resulting in a fourth audio signal;

a determination module configured to determine a model loss value of the audio processing model based on a difference between the fourth audio signal and a fifth audio signal, wherein the fifth audio signal is obtained by performing audio 3A processing on the second audio signal, the audio 3A processing includes canceling the second playback sound in the second audio signal;

an adjustment module configured to adjust model parameters of the audio processing model based on the model loss value such that the model loss value is below a preset threshold.

In an exemplary embodiment, the audio processing model comprises an encoding submodel, a coupling submodel, and a decoding submodel;

the input module configured to: inputting the first audio signal and the second audio signal into the coding submodel to obtain a first audio characteristic coded according to the first audio signal and a second audio characteristic coded according to the second audio signal; splicing the first audio feature and the second audio feature to obtain a spliced audio feature; inputting the spliced audio features into the coupling submodel to obtain coupled audio features; and inputting the coupled audio features into the decoding submodel to obtain the fourth audio signal decoded according to the coupled audio features.

In an exemplary embodiment, the obtaining module is configured to: playing the third audio signal by using high-fidelity sound equipment to generate the first playing sound when the first audio signal is played by using a loudspeaker to generate the second playing sound; and collecting the first playing sound and the second playing sound by using a microphone to obtain the second audio signal.

In an exemplary embodiment, further comprising:

an audio 3A processing module configured to perform the audio 3A processing, wherein the audio 3A processing further comprises: performing background noise suppression processing on the second audio signal from which the second playback sound is eliminated; and performing automatic gain control on the second audio signal after the background noise suppression processing.

The encoding submodel and the coupling submodel comprise an encoder in a Transformer model, and the decoding submodel comprises a decoder in the Transformer model.

An audio processing apparatus comprising:

the acquisition module is configured to acquire a trained audio processing model, wherein the trained audio processing model is obtained by training according to the training method of the audio processing model;

an input module configured to input a sixth audio signal and a seventh audio signal into the audio processing model, wherein the seventh audio signal comprises: a mixed collected signal of the speaker's voice and a third playing sound of the sixth audio signal; the sixth audio signal is an interference audio signal of the voice of the speaker;

an output module configured to receive, from the audio processing model, an eighth audio signal after performing audio processing on the sixth audio signal and a seventh audio signal.

In an exemplary embodiment, the input module is configured to mix and collect the speaker's voice and the third playback sound by using a microphone of the edge device to obtain the seventh audio signal when the sixth audio signal is played by using a speaker of the edge device to generate the third playback sound.

An electronic device, comprising:

a memory;

a processor;

wherein the memory has stored therein an application executable by the processor for causing the processor to perform the method of training an audio processing model as defined in any one of the above, or the method of audio processing as defined in any one of the above.

A computer readable storage medium having computer readable instructions stored thereon, which, when executed by a processor, cause the processor to perform a method of training an audio processing model as defined in any one of the above, or a method of audio processing as defined in any one of the above.

As can be seen from the foregoing technical solutions, in an embodiment of the present invention, a training sample is obtained, where the training sample includes a first audio signal and a second audio signal, where the second audio signal includes: a mixed acquisition signal of a first playing sound of the third audio signal and a second playing sound of the first audio signal; inputting the first audio signal and the second audio signal into an audio processing model to obtain a fourth audio signal; determining a model loss value of the audio processing model based on a difference between a fourth audio signal and a fifth audio signal, wherein the fifth audio signal is obtained by performing audio 3A processing on the second audio signal, and the audio 3A processing includes eliminating a first playing sound in the second audio signal; and adjusting model parameters of the audio processing model based on the model loss value so that the model loss value is lower than a preset threshold value. Therefore, the network model with deep learning capability is used for replacing the conventional audio 3A processing, the conversion between the frequency domain and the time domain is not needed, and the processing flow is accelerated. In addition, the embodiment of the invention avoids complex operations such as e power and the like, can also reduce the operation amount and the system complexity, and is particularly suitable for application scenes such as edge equipment and the like which are difficult to provide sufficient operation amount.

Drawings

Fig. 1 is an exemplary schematic diagram of a prior art audio 3A process.

FIG. 2 is an exemplary flowchart of a method for training an audio processing model according to an embodiment of the present invention.

FIG. 3 is an exemplary diagram illustrating a training process of an audio processing model according to an embodiment of the present invention.

FIG. 4 is an exemplary block diagram of a deep learning module according to an embodiment of the present invention.

Fig. 5 is an exemplary flowchart of an audio processing method according to an embodiment of the invention.

FIG. 6 is an exemplary diagram illustrating audio processing performed by the trained audio processing model according to an embodiment of the present invention.

Fig. 7 is an exemplary block diagram of an audio processing model training apparatus according to an embodiment of the present invention.

Fig. 8 is an exemplary block diagram of an audio processing device according to an embodiment of the present invention.

Fig. 9 is an exemplary block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

For simplicity and clarity of description, the invention will be described below by describing several representative embodiments. Numerous details of the embodiments are set forth to provide an understanding of the principles of the invention. It will be apparent, however, that the invention may be practiced without these specific details. Some embodiments are not described in detail, but rather are merely provided as frameworks, in order to avoid unnecessarily obscuring aspects of the invention. Hereinafter, "comprising" means "including but not limited to", "according to … …" means "according to at least … …, but not limited to only … …". In view of the language convention of chinese, the following description, when it does not specifically state the number of a component, means that the component may be one or more, or may be understood as at least one.

Hereinafter, terms related to the embodiments of the present disclosure are explained.

Audio 3A processing: is a collective term of three audio processing algorithms of AEC, ANS and AGC.

AEC: echo refers to the acoustic signal formed after the sound played by the local speaker of the device is picked up by the microphone. AEC is the process of canceling echo from the signal picked up by the microphone, while preserving the local user speech.

An ANS: refers to a process of recognizing and removing background noise in a sound.

AGC: the method is mainly used for adjusting the volume amplitude and improving the performance of sound in a noisy environment. For example, a person normally talks at a volume between 40-60dB, typically below 25dB sounds hard, while above 100dB sounds uncomfortable, the AGC can adjust the volume to a range acceptable to the person.

Convolutional Neural Network (CNN): the neural network is a feedforward type neural network, and is one of the most representative neural networks in the technical field of deep learning at present.

Deep Neural Networks (Deep Neural Networks, DNN): can be understood as a neural network with many hidden layers.

In consideration of the defects of audio 3A processing in the prior art, the network model with deep learning capability is used for replacing conventional audio 3A processing, conversion between a frequency domain and a time domain is not needed, and the processing flow is accelerated. In addition, the embodiment of the invention avoids complex operations such as e power and the like, and can also reduce the operation amount and the system complexity.

FIG. 2 is an exemplary flowchart of a method for training an audio processing model according to an embodiment of the present invention. After the training of the audio processing model is completed by the training method, the audio processing model can replace the conventional audio 3A processing.

As shown in fig. 2, the method includes:

step 201: obtaining training samples, the training samples including a first audio signal and a second audio signal, wherein the second audio signal includes: and a mixed acquisition signal of the first playing sound of the third audio signal and the second playing sound of the first audio signal.

The third audio signal is a target audio signal that needs to be kept as pure as possible during audio processing, for example, a pure speech signal of at least one speaker. For example, the third audio signal may be implemented as: and the audio file is recorded by using the recording equipment, and the speaker reads the audio file in a pure way within a preset time length. The first audio signal is an audio signal that interferes with the playing sound of the third audio signal (i.e., the first playing sound), such as an audio signal that has been played historically by a speaker, such as a music file, a television file, etc.

The first playing sound can be obtained by playing the third audio signal through a sound playing device (e.g., a speaker). While the third audio signal is played, the first audio signal is played through another sound playing device (e.g., a hi-fi device), and a second playing sound that causes interference to the first playing sound can be obtained. A mixed collection signal obtained by mixing and collecting the first playing sound and the second playing sound by using a sound collection device (e.g., a microphone) is a second audio signal.

Step 202: and inputting the first audio signal and the second audio signal into an audio processing model to obtain a fourth audio signal.

Here, the audio processing model is a model constructed by a deep learning technique. The inputs to the audio processing model are: a first audio signal and a second audio signal. The audio processing model performs audio processing on the first audio signal and the second audio signal to obtain a fourth audio signal. Specifically, the audio processing may include: (1) respectively performing dimensionality reduction processing on the first audio signal and the second audio signal to obtain a first audio characteristic corresponding to the first audio signal and a second audio characteristic corresponding to the second audio signal; (2) performing feature processing (e.g., splicing or fusing, etc.) on the first audio feature and the second audio feature to obtain a feature-processed audio feature; (3) and performing dimensionality-increasing processing on the audio features after feature processing to obtain a fourth audio signal. By performing the

subsequent steps

203 and 204, the fourth audio signal output by the audio processing model can be made similar or identical to the audio signal after performing the speech 3A processing on the second audio signal.

In an exemplary embodiment, the audio processing model includes an encoding submodel, a coupling submodel, and a decoding submodel; step 202 specifically includes: inputting the first audio signal and the second audio signal into the coding submodel to obtain a first audio characteristic coded according to the first audio signal and a second audio characteristic coded according to the second audio signal; splicing the first audio feature and the second audio feature to obtain a spliced audio feature; inputting the spliced audio features into a coupling sub-model to obtain coupled audio features; and inputting the coupled audio features into a decoding submodel to obtain a fourth audio signal decoded according to the coupled audio features. Therefore, the embodiment of the invention provides a specific structure of the audio processing model.

In an exemplary embodiment, the encoding submodel, the coupling submodel, and the decoding submodel each include a deep learning module, the deep learning module including at least one CNN and at least one DNN. Therefore, the embodiment of the invention can quickly construct the audio processing model based on the CNN and the DNN.

In an exemplary embodiment, the encoding submodel and the coupling submodel include an Encoder (Encoder) in a transform model, and the decoding submodel includes a Decoder (Decoder) in a transform model. Therefore, the embodiment of the invention can also adopt a Transformer model to quickly construct the audio processing model.

The above exemplary descriptions of the audio processing model and the typical structure of the coding sub-model, the coupling sub-model and the decoding sub-model, it will be appreciated by those skilled in the art that such descriptions are merely exemplary and are not intended to limit the embodiments of the present invention.

Step 203: determining a model loss value of the audio processing model based on a difference between a fourth audio signal and a fifth audio signal, wherein the fifth audio signal is obtained by performing audio 3A processing on the second audio signal, and the audio 3A processing includes canceling the second playing sound in the second audio signal.

Here, by performing audio 3A processing on the second audio signal, a fifth audio signal is obtained. For example, the audio 3A processing performed on the second audio signal includes: (1) eliminating a second playing sound in the second audio signal through an AEC algorithm; (2) automatically denoising the second audio signal without the second playing sound through an ANS algorithm; (3) and realizing automatic gain control on the second audio signal after AGC processing through an AGC algorithm. In the audio 3A processing performed on the second audio signal, the specifically adopted algorithm may refer to the prior art in the art, and details of the embodiment of the present invention are not repeated herein. In addition, the execution steps of the AEC algorithm, the ANS algorithm, and the AGC algorithm may be changed during the audio 3A processing performed on the second audio signal, which is not limited in the embodiment of the present invention.

And determining a difference between the fourth audio signal and the fifth audio signal as a model loss value of the audio processing model. The model loss value is used to evaluate a difference between a predicted value (i.e., the fourth audio signal) and a true value (i.e., the fifth audio signal) of the audio processing model, so that model parameters of the audio processing model can be adjusted based on the difference.

Step 204: adjusting model parameters of the audio processing model based on the model loss value such that the model loss value is below a preset threshold.

In general, the smaller the model penalty value, the better the performance of the audio processing model. Step 204 specifically includes: and determining the model parameters enabling the model loss value to be lower than a preset threshold value by utilizing a back propagation algorithm along the gradient descending direction of the model loss value, thereby finishing the training process of the audio processing model.

The audio processing model, which completes the above training process, can be used to perform audio processing equivalent to the audio 3A processing effect on arbitrary audio.

FIG. 3 is an exemplary diagram illustrating a training process of an audio processing model according to an embodiment of the present invention. As shown in fig. 3, the audio processing model to be trained includes an encoding sub-model, a coupling sub-model, and a decoding sub-model. And splicing (concat) processing is also included between the coding submodel and the coupling submodel.

FIG. 4 is an exemplary block diagram of a deep learning module according to an embodiment of the present invention. It can be seen that the deep learning module includes 3 CNN models and 2 DNN models, where the 3 CNN models are neural networks with coding function, and the 2 DNN models form a Fully Connected Layer (full Connected Layer). In one exemplary embodiment, the encoding submodel, the coupling submodel, and the decoding submodel each include a deep learning module as shown in FIG. 4.

In another exemplary embodiment, the encoding submodel, the coupling submodel may comprise an encoder in a Transformer model, and the decoding submodel comprises a decoder in the Transformer model.

The complete training process of the audio processing model is described below.

The first step is as follows: training samples are obtained. The process of obtaining training samples includes:

(1) a clean speech signal (i.e., the third audio signal) of the speaker is prepared for a number of durations.

(2) An audio signal (i.e., the first audio signal) that may interfere with the third audio signal is prepared. For example, considering that the edge device is a common audio 3A processing device, a television program, a song, etc., which the edge device has historically played, may be used as the first audio signal.

(3) The microphone is arranged close to the loudspeaker, for example directly above the loudspeaker (for example 2 cm). The microphone is opened to collect audio, the loudspeaker is used to play the first audio signal, the high-fidelity sound equipment is used to play the third audio signal, and data collection is performed in such a way that the microphone is used to collect mixed collected signals of the first playing sound of the third audio signal and the second playing sound of the first audio signal, namely, the second audio signal.

Here, the audio 3A processing may be further performed on the second audio signal to obtain a fifth audio signal. The audio 3A processing performed on the second audio signal includes: (1) eliminating a second playing sound in the second audio signal through an AEC algorithm; (2) automatically reducing noise of the second audio signal with the second playing sound eliminated through an ANS algorithm; (3) and carrying out automatic gain control on the second audio signal after AGC processing through an AGC algorithm to obtain a fifth audio signal.

The second step is that: the first audio signal and the second audio signal are input to an audio processing model. The audio processing model performs audio processing on the first audio signal and the second audio signal, and specifically includes: (1) the coding submodel codes the first audio signal to obtain a first audio characteristic, and the coding submodel codes the second audio signal to obtain a second audio characteristic; (2) splicing the first audio characteristic and the second audio characteristic to obtain a spliced audio characteristic; (3) the coding sub-model performs feature fusion on the spliced audio features to obtain coupled audio features; (4) and the decoding submodel decodes the coupled audio features to obtain a fourth audio signal.

The third step: a model loss value for the audio processing model is determined based on a difference (e.g., a mean square error) between the fourth audio signal and a fifth audio signal resulting from performing audio 3A processing on the second audio signal.

The fourth step: and adjusting model parameters of the audio processing model based on the model loss value so that the model loss value is lower than a preset threshold value. For example, based on the model loss value, the respective model parameters in the encoding submodel, the coupling submodel and the decoding submodel are adjusted respectively.

At this point, the training process of the audio processing model is completed. Then, audio processing equivalent to the audio 3A processing effect can be performed on arbitrary audio using the audio processing model that completes the training process.

Fig. 5 is an exemplary flowchart of an audio processing method according to an embodiment of the present invention.

As shown in fig. 5, the audio processing method includes:

step 501: and acquiring a trained audio processing model, wherein the trained audio processing model is obtained by training according to the training method of the audio processing model.

Step 502: inputting a sixth audio signal and a seventh audio signal into the audio processing model, wherein the seventh audio signal comprises: the third playing sound of the sixth audio signal and the voice of the speaker are mixed to acquire a signal.

Here, the sixth audio signal is an audio signal that interferes with the voice of the speaker. For example, when the speaker makes a call with another person through the edge device, the sixth audio signal is program audio (e.g., music file, television file, etc.) being played on the edge device. The playing sound of the sixth audio signal will interfere with the speech processing of the speaker.

Step 503: an eighth audio signal after performing audio processing on the sixth audio signal and the seventh audio signal is received from the audio processing model.

And after receiving the sixth audio signal and the seventh audio signal, the audio processing model outputs an eighth audio signal. The eighth audio signal is: the audio processing model performs an audio 3A processed signal on the seventh audio signal using the sixth audio signal. Since the audio 3A processing involves echo cancellation, the model input of the audio processing model needs to contain a seventh audio signal that is the object of echo cancellation.

Considering that audio 3A processing is often required on the edge device, and the edge device is more sensitive to the occupation of computing resources, it is preferable to apply the embodiment of the present invention on the edge device. Preferably, when the sixth audio signal is played by using the speaker of the edge device to generate the third played sound, the microphone of the edge device is used to mix the collected speaker's voice and the third played sound to obtain the seventh audio signal. For example, the edge device may be implemented as: a mobile terminal, a laptop, a smart speaker, a smart television, a personal digital assistant, or a smart headset, etc.

FIG. 6 is an exemplary diagram illustrating audio processing performed by the trained audio processing model according to an embodiment of the present invention. The audio processing model of fig. 6 has a similar structure to the model of fig. 3, except that the audio processing model of fig. 6 has been trained. The following description will explain an embodiment of the present invention applied to an edge device. The trained audio processing model may be disposed in an edge device, such as a storage medium accessible to a neural Network Processor (NPU) of the edge device.

At the edge device, the specific process of applying the trained audio processing model to perform audio processing comprises the following steps:

the first step is as follows: while playing a sixth audio signal (e.g., a music file, a television file, etc.) with the speaker of the edge device to produce a third played sound, the microphone of the edge device is simultaneously turned on. The microphone collects the voice of the speaker and the third played sound in a mixed manner to obtain a seventh audio signal.

The second step is that: and inputting the sixth audio signal and the seventh audio signal into the trained audio processing model. The trained audio processing model performs audio processing on the sixth audio signal and the seventh audio signal, and specifically includes: (1) the coding sub-model codes the sixth audio signal to obtain a third audio characteristic, and the coding sub-model codes the seventh audio signal to obtain a fourth audio characteristic; (2) splicing the third audio characteristic and the fourth audio characteristic to obtain a spliced audio characteristic; (3) the coding sub-model performs characteristic fusion on the spliced audio features to obtain coupled audio features; (4) and the decoding submodel decodes the coupled audio features to obtain an eighth audio signal. The eighth audio signal is: the trained audio processing model performs an audio 3A processed signal on the seventh audio signal using the sixth audio signal.

FIG. 7 is an exemplary block diagram of an audio processing model training apparatus according to an embodiment of the present invention. As shown in fig. 7, the training apparatus 700 for an audio processing model includes:

an obtaining module 701 configured to obtain a training sample, the training sample comprising a first audio signal and a second audio signal, wherein the second audio signal comprises: a mixed acquisition signal of a first playing sound of the third audio signal and a second playing sound of the first audio signal;

an input module 702 configured to input the first audio signal and the second audio signal into an audio processing model, resulting in a fourth audio signal;

a determining module 703 configured to determine a model loss value of the audio processing model based on a difference between a fourth audio signal and a fifth audio signal, wherein the fifth audio signal is obtained by performing audio 3A processing on the second audio signal, and the audio 3A processing includes eliminating a second playing sound in the second audio signal;

an adjusting module 704 configured to adjust model parameters of the audio processing model based on the model loss value such that the model loss value is below a preset threshold.

In an exemplary embodiment, the audio processing model includes an encoding submodel, a coupling submodel, and a decoding submodel; an input module 702 configured to: inputting the first audio signal and the second audio signal into an encoding submodel to obtain a first audio characteristic encoded according to the first audio signal and a second audio characteristic encoded according to the second audio signal; splicing the first audio feature and the second audio feature to obtain a spliced audio feature; inputting the spliced audio features into a coupling submodel to obtain coupled audio features; and inputting the coupled audio features into a decoding submodel to obtain a fourth audio signal decoded according to the coupled audio features.

In an exemplary embodiment, the obtaining module 701 is configured to: playing the third audio signal by using high-fidelity sound equipment to generate the first playing sound when the first audio signal is played by using a loudspeaker to generate the second playing sound; and acquiring the first playing sound and the second playing sound by using a microphone to obtain the second audio signal.

In an exemplary embodiment, the third audio signal is a clean speech signal, and the first audio signal is an interfering audio signal of the clean speech signal.

Fig. 8 is an exemplary block diagram of an audio processing device according to an embodiment of the present invention. As shown in fig. 8, the audio processing apparatus 800 includes:

an obtaining module 801 configured to obtain a trained audio processing model, where the trained audio processing model is obtained by training according to the training method of any one of the above audio processing models;

an input module 802 configured to input a sixth audio signal and a seventh audio signal into the audio processing model, wherein the seventh audio signal comprises: a mixed acquisition signal of the third playing sound of the speaker's voice and the sixth audio signal; the sixth audio signal is an interference audio signal of the voice of the speaker;

an output module 803 configured to receive, from the audio processing model, an eighth audio signal after performing audio processing on the sixth audio signal and the seventh audio signal.

In an exemplary embodiment, the input module 802 is configured to mix the captured speaker's voice and the third playback sound using a microphone of the edge device to obtain a seventh audio signal when the sixth audio signal is played using a speaker of the edge device to generate a third playback sound.

The invention also provides a training device of the audio processing model and an audio processing device. The training device of the audio processing model or the audio processing device comprises: a processor; a memory; in which an application program executable by a processor is stored in the memory for causing the processor to execute the training method of the audio processing model or the audio processing method as described above. The memory may be embodied as various storage media such as an Electrically Erasable Programmable Read Only Memory (EEPROM), a Flash memory (Flash memory), and a Programmable Read Only Memory (PROM). The processor may be implemented to include one or more central processors or one or more field programmable gate arrays, wherein the field programmable gate arrays integrate one or more central processor cores. In particular, the central processor or central processor core may be implemented as a CPU, MCU or Digital Signal Processor (DSP).

Fig. 9 is an exemplary block diagram of an electronic device according to an embodiment of the present invention. Preferably, the electronic device 900 may be implemented as an edge device.

The electronic device 900 includes: a processor 901 and a memory 902. Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a Graphics Processing Unit (GPU), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI processor for processing computing operations related to machine learning. For example, the AI processor may be implemented as a neural network processor.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices.

In some embodiments, a non-transitory computer readable storage medium in the memory 902 is used to store at least one instruction for execution by the processor 901 to implement the audio processing model training method or the audio processing method provided by various embodiments in the present disclosure. In some embodiments, the electronic device 900 may further include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a touch display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909. The peripheral interface 903 may be used to connect at least one Input/Output (I/O) related peripheral to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902, and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 904 is used to receive and transmit Radio Frequency (RF) signals, also referred to as electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or Wireless Fidelity (Wi-Fi) networks. In some embodiments, the radio frequency circuit 904 may also include Near Field Communication (NFC) related circuits, which are not limited by this disclosure.

The display screen 905 is used to display a User Interface (UI). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, disposed on the front panel of the electronic device 900; in other embodiments, the display 905 may be at least two, respectively disposed on different surfaces of the electronic device 900 or in a folded design; in some implementations, the display 905 may be a flexible display disposed on a curved surface or on a folded surface of the electronic device 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display panel 905 can be made of Liquid Crystal Display (LCD), Organic Light-Emitting Diode (OLED), or the like.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting function and a Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp refers to a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation under different color temperatures.

Audio circuitry 907 may include a microphone and speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing, or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and located at different locations of the electronic device 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some implementations, the audio circuit 907 can also include a headphone jack.

The positioning component 908 is used to locate a current geographic Location of the electronic device 900 to implement navigation or Location Based Service (LBS). The Positioning component 908 may be a Positioning component based on the Global Positioning System (GPS) of the united states, the beidou System of china, the graves System of russia, or the galileo System of the european union. The power supply 909 is used to supply power to various components in the electronic device 900. The power source 909 may be ac, dc, disposable or rechargeable. When power source 909 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging.

Those skilled in the art will appreciate that the above-described arrangements are not limiting of the electronic device 900 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

It should be noted that not all steps and modules in the above flows and structures are necessary, and some steps or modules may be omitted according to actual needs. The execution sequence of the steps is not fixed and can be adjusted according to the needs. The division of each module is only for convenience of describing adopted functional division, and in actual implementation, one module may be divided into multiple modules, and the functions of multiple modules may also be implemented by the same module, and these modules may be located in the same device or in different devices.

The hardware modules in the various embodiments may be implemented mechanically or electronically. For example, a hardware module may include a specially designed permanent circuit or logic device (e.g., a special purpose processor such as an FPGA or ASIC) for performing specific operations. A hardware module may also include programmable logic devices or circuits (e.g., including a general-purpose processor or other programmable processor) that are temporarily configured by software to perform certain operations. The implementation of the hardware module in a mechanical manner, or in a dedicated permanent circuit, or in a temporarily configured circuit (e.g., configured by software), may be determined based on cost and time considerations.

The present invention also provides a machine-readable storage medium storing instructions for causing a machine to perform a method according to the present application. Specifically, a system or an apparatus equipped with a storage medium on which a software program code that realizes the functions of any one of the above-described embodiments is stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program code stored in the storage medium. Further, part or all of the actual operations may be performed by an operating system or the like operating on the computer by instructions based on the program code. The functions of any of the above-described embodiments may also be implemented by writing the program code read out from the storage medium to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causing a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on the instructions of the program code. Examples of the storage medium for supplying the program code include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs, DVD + RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or the cloud by a communication network.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for training an audio processing model, comprising:

adjusting model parameters of the audio processing model based on the model loss value so that the model loss value is lower than a preset threshold value;

the audio processing model comprises an encoding sub-model, a coupling sub-model and a decoding sub-model; the inputting the first audio signal and the second audio signal into an audio processing model to obtain a fourth audio signal includes:

2. The method of claim 1, further comprising:

and acquiring the first playing sound and the second playing sound by using a microphone to obtain the second audio signal.

3. The method of claim 2, wherein the third audio signal is a clean speech signal and the first audio signal is an interfering audio signal of the clean speech signal.

4. A method for training an audio processing model according to any of claims 1-3, wherein said performing audio 3A processing on the second audio signal further comprises:

performing automatic gain control on the second audio signal after the background noise suppression processing.

5. The method of training an audio processing model according to any one of claims 1-3,

the coding submodel, the coupling submodel and the decoding submodel respectively comprise a deep learning module, and the deep learning module comprises at least one convolutional neural network and at least one deep neural network; or

6. An audio processing method, comprising:

acquiring a trained audio processing model, wherein the trained audio processing model is obtained by training according to the training method of the audio processing model of any one of claims 1-5;

inputting a sixth audio signal and a seventh audio signal into the audio processing model, wherein the seventh audio signal comprises: a mixed collected signal of the speaker's voice and a third playing sound of the sixth audio signal; the sixth audio signal is an interference audio signal of the voice of the speaker;

receiving, from the audio processing model, an eighth audio signal on which audio processing is performed on the sixth audio signal and a seventh audio signal.

7. The audio processing method according to claim 6,

and when the sixth audio signal is played by using a loudspeaker of the edge device to generate the third playing sound, the voice of the speaker and the third playing sound are mixed and collected by using a microphone of the edge device to obtain the seventh audio signal.

8. An apparatus for training an audio processing model, comprising:

an adjustment module configured to adjust model parameters of the audio processing model based on the model loss value such that the model loss value is below a preset threshold; the audio processing model comprises an encoding sub-model, a coupling sub-model and a decoding sub-model;

9. The apparatus for training an audio processing model according to claim 8,

the acquisition module configured to: playing the third audio signal with a high fidelity audio device to produce the first played sound while playing the first audio signal with a speaker to produce the second played sound; and acquiring the first playing sound and the second playing sound by using a microphone to obtain the second audio signal.

10. The apparatus for training an audio processing model according to claim 9, wherein the third audio signal is a clean speech signal, and the first audio signal is an interfering audio signal of the clean speech signal.

11. An apparatus for training an audio processing model according to any one of claims 8-10, further comprising:

an audio 3A processing module configured to perform the audio 3A processing, wherein the audio 3A processing further comprises: performing background noise suppression processing on the second audio signal from which the second playback sound is eliminated; performing automatic gain control on the second audio signal after the background noise suppression processing.

12. Training apparatus of an audio processing model according to any of claims 8-10,

13. An audio processing apparatus, comprising:

an obtaining module configured to obtain a trained audio processing model, wherein the trained audio processing model is obtained by training according to the training method of the audio processing model according to any one of claims 1-5;

14. The audio processing apparatus according to claim 13,

the input module is configured to mix and collect the voice of the speaker and the third playing sound by using a microphone of the edge device to obtain the seventh audio signal when the sixth audio signal is played by using a speaker of the edge device to generate the third playing sound.

15. An electronic device, comprising:

a memory;

a processor;

wherein the memory has stored therein an application program executable by the processor for causing the processor to perform a method of training an audio processing model as claimed in any one of claims 1 to 5, or a method of audio processing as claimed in any one of claims 6 to 7.

16. A computer-readable storage medium having computer-readable instructions stored thereon, which, when executed by a processor, cause the processor to perform a method of training an audio processing model according to any one of claims 1 to 5, or a method of audio processing according to any one of claims 6 to 7.