CN117334212A

CN117334212A - Processing method and device and electronic equipment

Info

Publication number: CN117334212A
Application number: CN202311277815.XA
Authority: CN
Inventors: 黄洪舟; 肖荣彬
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-01-02

Abstract

The application discloses a processing method, a processing device and electronic equipment, wherein the processing method comprises the following steps: obtaining target audio, the target audio comprising a plurality of different sounds; processing the target audio based on an audio processing engine, generating at least a first audio for audio output and a second audio for audio output; the first audio comprises a first sound, the second audio at least comprises a second sound and a third sound, the first sound, the second sound and the third sound belong to a plurality of different sounds included in the target audio, and the first sound, the second sound and the third sound are different from each other.

Description

Processing method and device and electronic equipment

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a processing method, an apparatus, and an electronic device.

Background

Currently, users need to extract useful speech for themselves in noisy environments. The separation of speech is typically accomplished using computational auditory scene analysis techniques CASA (Computational Auditory Scene Analysis).

However, in a real application scene, there may be a case where each path of sound cannot be separated, resulting in a decrease in the reliability of sound separation.

Disclosure of Invention

In view of this, the present application provides a processing method, apparatus and electronic device, as follows:

a method of processing, the method comprising:

obtaining target audio, the target audio comprising a plurality of different sounds;

processing the target audio based on an audio processing engine, generating at least a first audio for audio output and a second audio for audio output;

the first audio comprises a first sound, the second audio at least comprises a second sound and a third sound, the first sound, the second sound and the third sound belong to a plurality of different sounds included in the target audio, and the first sound, the second sound and the third sound are different from each other.

In the above method, preferably, the target audio includes a plurality of different sounds, including at least one of:

based on a time dimension, the first sound overlaps the second sound;

based on a time dimension, the first sound overlaps the third sound;

based on a time dimension, the first sound does not overlap the second sound;

based on a time dimension, the first sound does not overlap with the third sound.

The above method, preferably, generating at least a first audio for audio output and a second audio for audio output based on the processing of the target audio by the audio processing engine, includes:

processing the target audio based on a first model to generate the first audio;

the target audio is processed based on the first audio, and the second audio is generated.

The above method, preferably, processes the target audio based on the first audio, and generates the second audio, including:

the first audio and the target audio are processed based on a second model, the second model is complementary to the first model, the first model is used for determining the first audio, and the second model is used for eliminating the first audio.

In the above method, preferably, the processing the target audio based on the first model to generate the first audio includes:

processing the target audio based on a plurality of filtering modules, wherein each filtering module corresponds to one sub-waveform, and each filtering module comprises audio parameters corresponding to the sub-waveform;

if one of the filter modules outputs sub-waveform audio based on processing the target audio by a plurality of filter modules, the sub-waveform audio is taken as the first audio.

In the above method, preferably, the processing the target audio based on the first model, and generating the first audio includes:

if at least two filtering modules output sub-waveform audio based on the processing of the target audio by the plurality of filtering modules, the target filtering module is obtained based on the audio parameters of the filtering modules of the sub-waveform audio processed and output by the result model;

and processing the target audio based on the target filtering module to obtain the first audio.

The above method, preferably, the method further comprises:

if the first audio is output, obtaining first feedback information for the first audio; correcting the result model based on the first feedback information; or alternatively;

obtaining second feedback information for the second audio if the second audio is output; correcting the result model based on the second feedback information.

In the above method, preferably, obtaining the target audio includes:

picking up a first acquisition signal in a first range corresponding to a first beam based on the first beam, wherein the first acquisition signal represents sound in the first range, and the first acquisition signal corresponds to a first sound acquisition device;

picking up a second acquisition signal in a second range corresponding to a second beam based on the second beam, the second acquisition signal characterizing sound in the second range, the second acquisition signal corresponding to a second sound acquisition device;

picking up a third acquisition signal corresponding to a position of a third beam based on the third beam, wherein the third acquisition signal corresponds to a third sound acquisition device, the position corresponding to the third beam belongs to a third range, and the third range is determined by the first sound acquisition device, the second sound acquisition device and the third sound acquisition device;

the target audio is generated based on the first sound signal, the second sound signal, and the third sound signal.

A processing apparatus, comprising:

an audio obtaining unit configured to obtain a target audio including a plurality of different sounds;

an engine processing unit for processing the target audio based on an audio processing engine, generating at least a second audio for audio output;

An electronic device, comprising:

a memory for storing a computer program and data generated by the operation of the computer program;

a processor for executing the computer program to implement: obtaining target audio, the target audio comprising a plurality of different sounds; processing the target audio based on an audio processing engine, generating at least a first audio for audio output and a second audio for audio output; the first audio comprises a first sound, the second audio at least comprises a second sound and a third sound, the first sound, the second sound and the third sound belong to a plurality of different sounds included in the target audio, and the first sound, the second sound and the third sound are different from each other.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a processing method according to a first embodiment of the present disclosure;

FIG. 2 is an exemplary diagram of sound 1, sound 2, and sound 3 in an embodiment of the present application;

FIG. 3 is a partial flow chart of a processing method according to a first embodiment of the present disclosure;

FIGS. 4 and 5 are respectively exemplary diagrams for obtaining a first audio in an embodiment of the present application;

FIG. 6 is an exemplary diagram of obtaining second audio in an embodiment of the present application;

fig. 7 is an exemplary diagram of obtaining feedback information in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a processing device according to a second embodiment of the present disclosure;

fig. 9 is another schematic structural diagram of a processing device according to a second embodiment of the present disclosure;

fig. 10 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Referring to fig. 1, a flowchart of a processing method according to an embodiment of the present application is shown, and the method may be applied to an electronic device capable of performing data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for realizing sound separation so as to improve the reliability of the sound separation.

Specifically, the method in this embodiment may include the following steps:

step 101: target audio is obtained, the target audio comprising a plurality of different sounds.

The target audio may be audio containing a plurality of different sounds collected in an outdoor environment or an indoor space. These different sounds come from different sound sources, such as a person or a musical instrument, respectively.

For example, in a conference room, a participant records audio in the conference room with a microphone on the phone, including speech audio of the participant and music audio being played.

For another example, on an outdoor court, a worker uses a video recorder to record a video of a ball game in which the commentator commentary on the ball game, wherein the audio of the commentator commentary on the ball game is collected through a microphone. Including the commentator's commentary audio and the other spectators' cheering audio and music audio being played.

Step 102: based on the audio processing engine processing the target audio, at least a first audio for audio output and a second audio for audio output are generated.

The first audio comprises a first sound, the second audio at least comprises a second sound and a third sound, the first sound, the second sound and the third sound belong to a plurality of different sounds contained in the target audio, and the first sound, the second sound and the third sound are different from each other.

The plurality of sounds included in the target audio may or may not overlap in the time dimension. Specifically, the target audio includes a plurality of different sounds including at least one of:

based on the time dimension, the first sound overlaps the second sound;

based on the time dimension, the first sound overlaps with the third sound;

based on the time dimension, the first sound does not overlap the second sound;

based on the time dimension, the first sound does not overlap with the third sound.

For example, as shown in fig. 2, sound 1 as a main sound and sound 2 and sound 3 as background sounds are contained in the target audio. In the time dimension, there is an overlapping signal portion of sound 1 and sound 2, there is also a non-overlapping signal portion of sound 1 and sound 2, there is an overlapping signal portion of sound 1 and sound 3, and there is also a non-overlapping signal portion of sound 1 and sound 3.

In particular, the audio processing engine is a program module capable of implementing sound separation, such as a machine learning model (e.g., a model of the deep neural network DNN (Deep Neural Networks)), a filtering model, and so forth.

As can be seen from the foregoing technical solutions, in a processing method according to the first embodiment of the present application, after obtaining a target audio including a plurality of different sounds, the target audio may be processed based on an audio processing engine, so as to generate at least a first audio for audio output and a second audio for audio output, where the first audio includes the first sound, the second audio includes at least the second sound and the third sound, and the three sounds belong to the plurality of different sounds included in the target audio, and the three sounds are different from each other. As can be seen, the present embodiment processes the target audio using the audio processing engine, thereby separating out a plurality of different sounds contained in the target audio, thereby improving the reliability of sound separation.

In one implementation, the target audio may be processed in step 102 by:

step 301: the target audio is processed based on the first model, generating first audio.

The first model at least comprises a plurality of filtering modules, and each filtering module comprises an audio parameter for obtaining the corresponding wavelet. The audio parameters may be parameters such as an audio mask corresponding to the sub-waveform, an audio frequency, and the like, and the audio mask can filter the corresponding sub-waveform in the target audio, so that the target audio can output the sub-waveform audio corresponding to the sub-waveform after passing through the filtering module.

Specifically, in step 301, the target audio may be processed based on the plurality of filtering modules in the first model, and then the first audio is obtained based on the plurality of filtering modules processing the target audio and the one or more sub-waveform audio output by the filtering modules.

In this embodiment, the target audio may be input to each filtering module at the same time, so that each filtering module processes the target audio according to its own audio parameter, and if the target audio includes the audio parameter corresponding to the filtering module, the filtering module outputs the sub-waveform audio corresponding to the audio parameter.

Or, in this embodiment, the target audio may be sequentially input to each filtering module, so that each filtering module processes the target audio sequentially according to its own audio parameter, and if the target audio includes an audio parameter corresponding to the audio parameter included in the filtering module, the filtering module outputs sub-waveform audio corresponding to the audio parameter.

In one implementation, the target audio may be processed based on the plurality of filtering modules in the first model in step 301, and if one filtering module outputs sub-waveform audio based on the plurality of filtering modules processing the target audio, the sub-waveform audio to be outputted is regarded as the first audio.

For example, as shown in fig. 4, the target audio is input to a first model, and one filter module C among a plurality of filter modules A, B, C, D in the first model outputs the sub-waveform audio C, and at this time, the outputted sub-waveform audio C is taken as a first sound, i.e., a first audio.

Taking a ball game explanation scene as an example, the scene sound contained in the main sound C with high volume in the explanation scene has a part of human voice B, when the C model outputs the thought main sound C '(the part of human voice C is contained therein) and the B model outputs the thought main sound B, the mask 1 needs to strip the B sound in the C', and the rest main sound C.

In another implementation, in step 301, the target audio may be processed based on the multiple filtering modules in the first model, if at least two filtering modules output sub-waveform audio based on the multiple filtering modules processing the target audio, the audio parameters of the filtering modules of the output sub-waveform audio are processed based on the result model, the target filtering module is obtained, and finally the target audio is processed based on the target filtering module, so as to obtain the first audio.

The result model is a model used for obtaining the target filtering module in the first model. The resulting model may be a machine learning model based on an attention mechanism, trained in advance using training samples, such that the resulting model is capable of processing multiple audio parameters to obtain a new audio parameter that is used to construct a new filtering module.

Specifically, the training samples of the result model include: samples of a plurality of audio parameters as input samples and samples of new audio parameters as output samples.

For example, as shown in fig. 5, the target audio is input to a first model, in which a plurality of filter modules A, B, C, D have filter modules B and C to output sub-waveform audio B and C, respectively, and then the resulting model is used to process the audio parameters of each of the filter modules B and C, and output new audio parameters, such as an audio mask corresponding to the new sub-waveform, and then a new filter module, i.e., a target filter module E, is constructed according to the new audio parameters, and finally the target audio is reprocessed using the target filter module E to obtain sub-waveform audio E filtered by the target filter module E, and the output sub-waveform audio E is used as the first sound, i.e., the first audio.

Taking a ball game explanation scene as an example, when the audio of the explanation scene is noisy, and under the condition that 2 persons are simultaneously explained, the sound b, the sound c and the background sound exist, the output of the sound b is regarded as the main sound b '(including part of the human sound b and the background sound), the output of the sound c is regarded as the main sound c' (including the background human sound), the new mask 2 is constructed by using the masks of b 'and c', and the new mask 2 processes the explanation audio again, and the main sound b remains.

It should be noted that, in this embodiment, the audio parameters of each filtering module may be obtained by:

firstly, extracting a multi-scale feature sequence from specific audio corresponding to audio parameters, such as a sound feature sequence corresponding to the specific audio at each frequency, and then processing the sound feature sequence corresponding to each frequency by using a parameter acquisition model to obtain the audio parameters such as an audio mask at each frequency. A corresponding filtering module is then constructed using each of the audio parameters.

Specifically, in this embodiment, when the target audio is processed based on the plurality of filtering modules in the first model, the audio parameter, such as the audio mask, of each filtering module may be used to multiply the amplitude of the target audio, so that, in the case that the target audio includes the sound signal corresponding to the audio parameter of the filtering module, the filtering module may output the sub-waveform audio corresponding to the audio parameter of the filtering module.

Step 302: a second audio is generated based on the first audio processing target audio.

In one implementation, the first audio may also be filtered from the target audio by a filtering algorithm to obtain the second audio, where the first audio is used as noise in the target audio.

In another implementation, the first audio and the target audio may be processed in step 302 based on a second model in the audio processing engine, the second audio being generated, the second model being complementary to the first model, the first model being used to determine the first audio, the second model being used to cancel the first audio.

For example, the second model may be a noise reduction model, a DNN-based model, or the like, and is configured to filter the first audio in the target audio to obtain other audio, i.e., the second audio, in the target audio, except for the first audio, thereby improving the audio quality of the second audio. Specifically, the training samples of the second model include: one audio sample and multiple audio samples as input samples and the other audio samples of the multiple audio samples as output samples.

As shown in fig. 6, the first audio is determined from the target audio using the first model in the audio processing engine, and then the first audio is removed from the target audio using the second model to output the second audio.

Based on the above implementation manner, the method in the present embodiment may further include at least one of the following processes:

if the first audio is output, obtaining first feedback information aiming at the first audio, and correcting a result model based on the first feedback information;

if the second audio is output, second feedback information for the second audio is obtained, and then the result model is corrected based on the second feedback information.

The first feedback information may be feedback information obtained according to a feedback input operation for the first audio. The first feedback information characterizes whether the first sound in the first audio meets output requirements, such as preserving the most main sounds, preserving the most original sounds, preserving the more intelligible main sounds. For example, the first feedback information is a numerical representation, the higher the numerical value, the more the first sound is characterized as meeting the demand; the lower the value, the less demanding the characterization of the first sound.

In particular, the feedback input operation may be obtained from a feedback input interface for the first audio. For example, as shown in fig. 7, a feedback control is output on the feedback input interface for inputting a numerical value or selecting a numerical value, based on which, in the case where the user performs a feedback input operation such as inputting a numerical value or selecting a certain numerical value on the feedback input interface for the first audio, the first feedback information for the first audio can be obtained in the present embodiment.

In this embodiment, when correcting the result model based on the first feedback information, the method may specifically be: model parameters in the result model are corrected so that the result model outputs new audio parameters for audio parameters of the filtering modules of the plurality of sub-waveform audios to construct a new filtering module, and therefore the audio processing engine of the optimized result model can obtain first audios meeting requirements more.

And the second feedback information may be feedback information obtained according to a feedback input operation for the second audio. The second feedback information characterizes whether the second and third sounds in the second audio meet output requirements, such as retaining the most original background sounds, retaining more intelligible content background sounds. For example, the second feedback information is a numerical representation, the higher the numerical value, the more the second and third sounds are characterized as meeting the demand; the lower the value, the less satisfying the requirements characterizing the second and third sounds.

In particular, the feedback input operation may be obtained from a feedback input interface for the second audio. For example, a feedback control is output on the feedback input interface for inputting a value or selecting a value, based on which, in the case where the user performs a feedback input operation such as inputting a value or selecting a value on the feedback input interface for the second audio, the second feedback information for the second audio can be obtained in this embodiment.

In this embodiment, when correcting the result model based on the second feedback information, the method may specifically be: model parameters in the result model are rectified so that the result model outputs new audio parameters for audio parameters of the filtering modules of the plurality of sub-waveform audios to construct a new filtering module, and therefore the audio processing engine of the optimized result model can obtain second audios meeting requirements more.

In one implementation, when the target audio is obtained in step 101, this may be achieved by:

based on the first beam, picking up a first acquisition signal in a first range corresponding to the first beam, the first acquisition signal characterizing sound in the first range, the first acquisition signal corresponding to the first sound acquisition device.

Picking up a second acquisition signal in a second range corresponding to the second beam based on the second beam, the second acquisition signal representing sound in the second range, the second acquisition signal corresponding to the second sound acquisition device;

picking up a third acquisition signal corresponding to a position of the third beam based on the third beam, wherein the third acquisition signal corresponds to a third sound acquisition device, the position corresponding to the third beam belongs to a third range, and the third range is determined by the first sound acquisition device, the second sound acquisition device and the third sound acquisition device;

target audio is generated based on the first sound signal, the second sound signal, and the third sound signal.

For example, the first collecting device, the second collecting device, and the third collecting device may be devices capable of collecting sound such as microphones.

Taking an electronic device as an example of a mobile phone, the first acquisition device is an upper side mic of the mobile phone, the second acquisition device is a lower side mic of the mobile phone, the third acquisition device is a back side mic of the mobile phone, the first beam and the second beam are heart-shaped beams, the first range is a pickup range of the heart-shaped beams of the upper side mic of the mobile phone, the second range is a pickup range of the heart-shaped beams of the lower side mic of the mobile phone, the third range is a beam forming beam range using a 30-degree area pickup range right in front of the mobile phone formed by the upper side mic, the lower side mic and the back side mic of the mobile phone, a position corresponding to the third beam is a sound generating position corresponding to a sound signal, the third range is a range containing the sound generating position of the sound signal, and the third range changes along with the movement of the sound generating position. Based on the above, three sound signals are respectively collected by three mic on the mobile phone, thereby generating the target audio.

Referring to fig. 8, a schematic structural diagram of a processing apparatus according to a second embodiment of the present application may be configured in an electronic device capable of performing data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for realizing sound separation so as to improve the reliability of the sound separation.

Specifically, the apparatus in this embodiment may include the following units:

an audio obtaining unit 801 for obtaining a target audio including a plurality of different sounds;

an engine processing unit 802 for processing the target audio based on the audio processing engine, generating at least a second audio for audio output;

As can be seen from the foregoing technical solution, in a processing apparatus according to the second embodiment of the present application, after obtaining a target audio including a plurality of different sounds, the target audio may be processed based on an audio processing engine, thereby generating at least a first audio for audio output and a second audio for audio output, where the first audio includes the first sound, the second audio includes at least the second sound and the third sound, the three sounds belong to the plurality of different sounds included in the target audio, and the three sounds are different from each other. As can be seen, the present embodiment processes the target audio using the audio processing engine, thereby separating out a plurality of different sounds contained in the target audio, thereby improving the reliability of sound separation.

In one implementation, the target audio includes a plurality of different sounds including at least one of:

based on a time dimension, the first sound overlaps the second sound;

based on a time dimension, the first sound overlaps the third sound;

based on a time dimension, the first sound does not overlap the second sound;

In one implementation, the engine processing unit 802 is specifically configured to: processing the target audio based on a first model to generate the first audio; the target audio is processed based on the first audio, and the second audio is generated.

In one implementation, the engine processing unit 802 is specifically configured to, when processing the target audio based on the first audio, generate the second audio: the first audio and the target audio are processed based on a second model, the second model is complementary to the first model, the first model is used for determining the first audio, and the second model is used for eliminating the first audio.

In one implementation, the engine processing unit 802 is specifically configured to, when processing the target audio based on the first model, generate the first audio: processing the target audio based on a plurality of filtering modules, wherein each filtering module corresponds to one sub-waveform, and each filtering module comprises audio parameters corresponding to the sub-waveform; if one of the filter modules outputs sub-waveform audio based on processing the target audio by a plurality of filter modules, the sub-waveform audio is taken as the first audio.

In one implementation, the engine processing unit 802 is specifically configured to, when processing the target audio based on the first model, generate the first audio: processing the target audio based on a plurality of filtering modules, wherein each filtering module corresponds to one sub-waveform, and each filtering module comprises audio parameters corresponding to the sub-waveform; if at least two filtering modules output sub-waveform audio based on the processing of the target audio by the plurality of filtering modules, the target filtering module is obtained based on the audio parameters of the filtering modules of the sub-waveform audio processed and output by the result model; and processing the target audio based on the target filtering module to obtain the first audio.

In one implementation, the apparatus in this embodiment may further include the following units, as shown in fig. 9:

a model optimizing unit 803 for obtaining first feedback information for the first audio if the first audio is output; correcting the result model based on the first feedback information; or alternatively; obtaining second feedback information for the second audio if the second audio is output; correcting the result model based on the second feedback information.

In one implementation, the audio obtaining unit 801 is specifically configured to: picking up a first acquisition signal in a first range corresponding to a first beam based on the first beam, wherein the first acquisition signal represents sound in the first range, and the first acquisition signal corresponds to a first sound acquisition device; picking up a second acquisition signal in a second range corresponding to a second beam based on the second beam, the second acquisition signal characterizing sound in the second range, the second acquisition signal corresponding to a second sound acquisition device; picking up a third acquisition signal corresponding to a position of a third beam based on the third beam, wherein the third acquisition signal corresponds to a third sound acquisition device, the position corresponding to the third beam belongs to a third range, and the third range is determined by the first sound acquisition device, the second sound acquisition device and the third sound acquisition device; the target audio is generated based on the first sound signal, the second sound signal, and the third sound signal.

It should be noted that, the specific implementation of each unit in this embodiment may refer to the corresponding content in the foregoing, which is not described in detail herein.

Referring to fig. 10, a schematic structural diagram of an electronic device according to a third embodiment of the present application may include the following structures:

a memory 1001 for storing a computer program and data generated by the operation of the computer program;

a processor 1002 for executing the computer program to implement: obtaining target audio, the target audio comprising a plurality of different sounds; processing the target audio based on an audio processing engine, generating at least a first audio for audio output and a second audio for audio output; the first audio comprises a first sound, the second audio at least comprises a second sound and a third sound, the first sound, the second sound and the third sound belong to a plurality of different sounds included in the target audio, and the first sound, the second sound and the third sound are different from each other.

As can be seen from the foregoing technical solution, in an electronic device according to the third embodiment of the present application, after obtaining a target audio including a plurality of different sounds, the target audio may be processed based on an audio processing engine, thereby generating at least a first audio for audio output and a second audio for audio output, where the first audio includes the first sound, the second audio includes at least the second sound and the third sound, the three sounds belong to the plurality of different sounds included in the target audio, and the three sounds are different from each other. As can be seen, the present embodiment processes the target audio using the audio processing engine, thereby separating out a plurality of different sounds contained in the target audio, thereby improving the reliability of sound separation.

Taking the "cocktail problem" as an example, the cocktail problem is the pain and difficulty in the field of speech recognition, people talk in the cocktail, the speech signals overlap, and the machine needs to separate them into separate signals.

In a mobile phone audio recording scene (3 mic), the application realizes sound separation through the following scheme:

(1) And constructing fixed double-sided heart-shaped beams of the binaural mic by using the upper mic and the lower mic of the mobile phone equipment to pick up sound.

(2) The beam range of the pickup range of the area of the thinnest 30 DEG in front is constructed by using the upper and lower mic of the mobile phone device and the back mic of the device.

(3) The sound signals obtained by the sound pickup of the left heart-shaped wave beam and the right heart-shaped wave beam and the sound signals obtained by the sound pickup of the front wave beam are used as input target audio.

(3) The target audio is input to an audio processing engine of the application to obtain a first sound, a second sound and a third sound, such as separating the voice output by the main beam sound object from other sounds output by the background sound object.

Wherein the audio processing engine processes the target audio through a filtering module having an audio mask (audio parameter) for each sound to determine a first sound, and then obtains a second sound and a third sound using the first sound and the target audio.

(4) And setting a reward mechanism, obtaining corresponding scoring values for the output main beam voice and other sounds, and then adaptively adjusting an audio mask according to the scoring values.

It should be noted that, the audio parameters of each filtering module for the mobile phone audio recording scene in the audio processing engine are obtained by the following modes:

(1) Firstly, acquiring specific audio, for example, acquiring the audio in a plurality of mobile phone audio-video recording scenes through three mic of a mobile phone, then acquiring a voice information sequence of the specific audio, extracting voice multi-scale characteristics from the voice information sequence of the specific audio, and constructing a multi-scale sound characteristic sequence;

(2) And obtaining the voice mask information and the background voice mask information (which can be human voice) of the main pronunciation object according to the attention mechanism for the voice features in the multi-scale voice feature sequence, namely the audio parameters of each filtering module.

In a ball game illustration scenario, the present application achieves sound separation by:

(1) The voice information and other sounds of the commentators in the ball game commentary scene are obtained through the audio parameters of each filtering module in the audio processing engine;

(2) Because the signal-to-noise ratio of the explanation scene is relatively high, the expected signal does not contain cheering sounds of the field environment, for this purpose, a reward mechanism is set in the application, corresponding scoring values are obtained for the output main beam voice and other sounds, then a new expected mask is adaptively generated according to the scoring values, and then voice information and other sounds of the commentators in the ball game explanation scene are obtained again according to the new expected mask and the audio in the ball game explanation scene, so that the separated sounds meet the user expectations.

It should be noted that, the audio parameters of each filtering module for the ball game explanation scene in the audio processing engine are obtained by the following ways:

(1) Firstly, acquiring specific audio, for example, acquiring audio in a plurality of ball game explanation scenes, namely multi-channel explanation voices through a camera, and constructing a multi-scale feature sequence by taking the multi-channel explanation voices as input;

(2) And obtaining the voice mask information and the background voice mask information (which can be human voice) of the main pronunciation object for the voice features in the multi-scale feature sequence according to the attention mechanism, namely the audio parameters of each filtering module.

The effects of the present application are exemplified below:

(1) The video of a basketball court and football game is input into the audio processing engine, the audio processing engine separates the on-site voice from the voice of the commentator, and further, the user can increase or decrease the commentator or increase or decrease the on-site voice through the operation interface.

(2) A section of double-channel voice which two persons speak simultaneously is input into the audio processing engine, the audio processing engine separates out the voice of the two persons and outputs the voice of the two persons on the left and right channels respectively, and a user can reduce the voice of one person and amplify the voice of the other person on the demonstration interface.

(3) A piece of music is input into the audio processing engine, the audio processing engine can separate the singing voice and the accompaniment, and further, a user can increase or decrease the singing voice and the accompaniment so as to make an accompaniment mode and a non-accompaniment mode.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of processing, the method comprising:

2. The method of claim 1, the target audio comprising a plurality of different sounds including at least one of:

based on a time dimension, the first sound overlaps the second sound;

based on a time dimension, the first sound overlaps the third sound;

based on a time dimension, the first sound does not overlap the second sound;

3. The method of claim 2, based on the audio processing engine processing the target audio, generating at least a first audio for audio output and a second audio for audio output, comprising:

processing the target audio based on a first model to generate the first audio;

4. The method of claim 3, processing the target audio based on the first audio, generating the second audio, comprising:

5. The method of claim 3, the processing the target audio based on a first model to generate the first audio, comprising:

6. The method of claim 3, the processing the target audio based on a first model, generating the first audio comprising:

7. The method of claim 5 or 6, the method further comprising:

8. The method of claim 1, obtaining target audio comprising:

9. A processing apparatus, comprising:

10. An electronic device, comprising: