CN115641868A

CN115641868A - Audio separation method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN115641868A
Application number: CN202211100895.7A
Authority: CN
Inventors: 王洋; 李晨星; 邓峰; 王晓瑞
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2022-09-09
Filing date: 2022-09-09
Publication date: 2023-01-24

Abstract

The disclosure relates to an audio separation method, an audio separation device, an electronic device and a computer-readable storage medium. The audio separation method comprises the following steps: the method comprises the steps that a mixed audio complex spectrum of audio to be separated and coarse-divided audio complex spectra of at least two tracks are obtained on the basis of a coarse-divided network of an audio separation model for the audio to be separated; obtaining the complex spectrum residual errors of the at least two tracks by using a residual error compensation network of an audio separation model for the coarse audio complex spectrum and the mixed audio complex spectrum of the at least two tracks; determining an audio complex spectrum according to the coarsely divided audio complex spectrum and the complex spectrum residual error for each track; respectively converting the audio complex spectrums of at least two tracks into audio signals; the rough separation network and the residual error compensation network both comprise a two-dimensional window self-attention network, and the two-dimensional window self-attention network comprises a multi-head self-attention layer and a two-dimensional window self-attention layer which are connected in series. According to the scheme, the information required by the multitask audio separation can be comprehensively captured, and the audio separation performance is improved.

Description

Audio separation method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to an audio separation method and apparatus, an electronic device, and a computer-readable storage medium.

Background

The audio signal in real life mainly includes voice, music, background noise and the like. When these signals are mixed, the intelligibility of the audio signal is reduced, which can impair the subsequent audio understanding task. For example, in a music information retrieval task, foreground voice and background noise can reduce the accuracy of retrieval; in the speech recognition task, background noise can reduce the accuracy of recognition, and music in audio can cause confusion of recognition results. Meanwhile, the extracted energy of the background noise is an important evaluation basis in the voice quality evaluation task. Therefore, the multitask audio separation is a technology capable of extracting three tracks (voice, music and noise) of mixed audio at a time, and is suitable for more application scenes compared with a single-task audio separation technology capable of extracting only two tracks.

The objective of the multi-task audio separation is to extract three tracks of mixed audio at a time through one model, while voice, music and noise have different acoustic characteristics, and the voice has short-time stationarity; music has a periodic and rich harmonic structure and high frequency components; the noise is relatively random and has no obvious structural characteristics. Therefore, the model not only needs to have the capability of extracting the global context dependency relationship so as to capture the long-time characteristics such as short-time stationarity and periodicity, but also needs to extract the local context dependency relationship and the similarity between time-frequency units so as to capture the characteristics such as harmonic structure, time-frequency distribution and the like, so that the voice, music and noise can be better distinguished. However, the existing multi-task audio separation method has shortcomings in extracting global dependence, local dependence and similarity between time-frequency units, and the separation performance needs to be improved.

Disclosure of Invention

The present disclosure provides an audio separation method, an audio separation apparatus, an electronic device, and a computer-readable storage medium, so as to solve at least the problem of how to improve the performance of multi-task audio separation in the related art, and may not solve any of the above problems.

According to a first aspect of the present disclosure, there is provided an audio separation method including: the method comprises the steps that a mixed audio complex spectrum of audio to be separated and coarse-divided audio complex spectra of at least two tracks are obtained on the basis of a coarse-divided network of an audio separation model; obtaining the complex spectrum residuals of the at least two tracks based on a residual compensation network of the audio separation model for the coarse audio complex spectrum and the mixed audio complex spectrum of the at least two tracks; for each track, determining an audio complex spectrum according to the coarsely divided audio complex spectrum and the complex spectrum residual error; respectively converting the audio complex spectrums of the at least two tracks into audio signals; the rough separation network and the residual error compensation network both comprise a two-dimensional window self-attention network, the two-dimensional window self-attention network comprises a multi-head self-attention layer and a two-dimensional window self-attention layer which are connected in series, the multi-head self-attention layer and the two-dimensional window self-attention layer are respectively used for extracting a first intermediate feature and a second intermediate feature, the first intermediate feature and the second intermediate feature are both two-dimensional matrixes formed by a plurality of time-frequency units, and the two-dimensional window self-attention layer is used for extracting a three-dimensional feature from the first intermediate feature and extracting the second intermediate feature from the three-dimensional feature.

Optionally, the two-dimensional window self-attention layer is configured to perform the following steps on the first intermediate feature: for each time-frequency unit in the first intermediate characteristic, extracting a set number of time-frequency units with a set scale spaced from the current time-frequency unit in the first intermediate characteristic to serve as characteristic vectors of the current time-frequency unit, and converging the characteristic vectors of the time-frequency units in the first intermediate characteristic to form the three-dimensional characteristic; performing windowing processing on the three-dimensional features to obtain a plurality of three-dimensional feature blocks; and extracting two-dimensional features from each three-dimensional feature block based on a multi-head self-attention mechanism, and combining the corresponding two-dimensional features according to the positions of the three-dimensional feature blocks in the three-dimensional features to obtain the second intermediate features.

Optionally, the coarse-divide network comprises a plurality of the two-dimensional window self-attention networks in series.

Optionally, the residual error compensation network includes an encoding network, the two-dimensional window self-attention network, and a decoding network.

Optionally, the obtaining, by the rough separation network based on an audio separation model, rough separation audio complex spectrums and mixed audio complex spectrums of at least two tracks of the audio to be separated includes: separating a mixed audio amplitude spectrum and the mixed audio complex spectrum from the audio to be separated, wherein the mixed audio complex spectrum comprises a mixed audio phase; inputting the mixed audio amplitude spectrum into the rough separation network to obtain rough separation audio amplitude spectrums of the at least two tracks; and for each track, determining the coarse audio complex spectrum according to the coarse audio amplitude spectrum and the mixed audio phase.

Optionally, the obtaining, for the coarsely divided audio complex spectrum and the mixed audio complex spectrum of the at least two tracks, a complex spectrum residual of the at least two tracks based on a residual compensation network of the audio separation model includes: for each track, determining a difference value between the mixed audio complex spectrum and the coarse-divided audio complex spectrum as an initial residual error; and inputting the initial residual error of the at least two tracks and the mixed audio complex spectrum into the residual error compensation network to obtain the complex spectrum residual error of the at least two tracks.

Optionally, the audio separation model is trained by the following steps: obtaining a sample audio, wherein the sample audio is formed by overlapping pure audio signals of at least two tracks, and obtaining a pure audio complex spectrum of each pure audio signal; for the sample audio, obtaining estimated audio complex spectrums and estimated audio signals of the at least two tracks of the sample audio based on an audio separation model to be trained; determining a loss value according to the pure audio complex spectrum, the estimated audio complex spectrum, the pure audio signal and the estimated audio signal of the at least two tracks; and adjusting parameters of the audio separation model to be trained based on the loss value to obtain the audio separation model.

Optionally, the determining a loss value according to the clean audio complex spectrum, the estimated audio complex spectrum, the clean audio signal, and the estimated audio signal of the at least two tracks includes: determining a first loss value according to the pure audio complex spectrum and the pre-estimated audio complex spectrum of each track; determining a second loss value according to the pure audio complex spectrum of each track and the estimated audio complex spectra of other tracks except the current track in the at least two tracks; determining a third loss value according to the pure audio signal and the pre-estimated audio signal of each track; and determining a total loss value according to the first loss value, the second loss value and the third loss value, wherein the total loss value is positively correlated with the first loss value, and the total loss value is negatively correlated with the second loss value and the third loss value.

According to a second aspect of the present disclosure, there is provided an audio separating apparatus comprising: the rough separation unit is configured to execute a rough separation network of an audio separation model on the basis of audio to-be-separated audio to obtain a mixed audio complex spectrum of the audio to be separated and rough separation audio complex spectrums of at least two tracks; a residual unit configured to perform a residual compensation network of the audio separation model on the coarsely divided audio complex spectrum and the mixed audio complex spectrum of the at least two tracks to obtain complex spectral residuals of the at least two tracks; a compensation unit configured to perform determining an audio complex spectrum from the coarsely divided audio complex spectrum and the complex spectrum residual for each track; a conversion unit configured to perform a conversion of the audio complex spectra of the at least two tracks into audio signals, respectively; the rough separation network and the residual error compensation network both comprise a two-dimensional window self-attention network, the two-dimensional window self-attention network comprises a multi-head self-attention layer and a two-dimensional window self-attention layer which are connected in series, the multi-head self-attention layer and the two-dimensional window self-attention layer are respectively used for extracting a first intermediate feature and a second intermediate feature, the first intermediate feature and the second intermediate feature are both two-dimensional matrixes formed by a plurality of time-frequency units, and the two-dimensional window self-attention layer is used for extracting a three-dimensional feature from the first intermediate feature and extracting the second intermediate feature from the three-dimensional feature.

Optionally, the rough separation unit is further configured to perform separation of a mixed audio magnitude spectrum and the mixed audio complex spectrum from the audio to be separated, where the mixed audio complex spectrum includes a mixed audio phase; inputting the mixed audio amplitude spectrum into the rough separation network to obtain rough separation audio amplitude spectrums of the at least two tracks; and for each track, determining the coarse audio complex spectrum according to the coarse audio amplitude spectrum and the mixed audio phase.

Optionally, the residual unit is further configured to perform determining, for each track, a difference value of the mixed audio complex spectrum and the coarsely divided audio complex spectrum as an initial residual; inputting the initial residuals of the at least two tracks and the mixed audio complex spectrum into the residual compensation network to obtain the complex spectrum residuals of the at least two tracks.

Optionally, the audio separation model is trained by the following steps: obtaining a sample audio, wherein the sample audio is formed by overlapping pure audio signals of at least two tracks, and obtaining a pure audio complex spectrum of each pure audio signal; for the sample audio, obtaining estimated audio complex spectrums and estimated audio signals of the at least two tracks of the sample audio based on an audio separation model to be trained; determining a loss value according to the pure audio complex spectrums, the pre-estimated audio complex spectrums, the pure audio signals and the pre-estimated audio signals of the at least two tracks; and adjusting parameters of the audio separation model to be trained based on the loss value to obtain the audio separation model.

Optionally, the determining a loss value according to the clean audio complex spectrum, the estimated audio complex spectrum, the clean audio signal, and the estimated audio signal of the at least two tracks includes: determining a first loss value according to the pure audio complex spectrum and the pre-estimated audio complex spectrum of each track; determining a second loss value according to the pure audio complex spectrum of each track and the estimated audio complex spectrum of other tracks except the current track in the at least two tracks; determining a third loss value according to the pure audio signal and the estimated audio signal of each track; and determining a total loss value according to the first loss value, the second loss value and the third loss value, wherein the total loss value is positively correlated with the first loss value, and the total loss value is negatively correlated with the second loss value and the third loss value.

According to a third aspect of the present disclosure, there is provided an electronic apparatus comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform an audio separation method according to the present disclosure.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium in which instructions, when executed by at least one processor, cause the at least one processor to perform an audio separation method according to the present disclosure.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, implement an audio separation method according to the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

according to the audio separation method and the audio separation apparatus of the embodiments of the present disclosure, the most basic network structure of the application is a two-dimensional window self-attention network. The network is able to capture global context dependencies by using a multi-headed self-attention layer. By establishing the two-dimensional window self-attention layer, the two-dimensional first intermediate features formed by the time-frequency units are converted into three-dimensional features, namely, more detailed information is extracted for each time-frequency unit to increase the dimension of the features, so that the obtained three-dimensional features contain richer local information compared with the two-dimensional first intermediate features, and the similarity analysis among the time-frequency units is facilitated. Meanwhile, the two-dimensional window self-attention layer can provide a two-dimensional window for windowing the obtained three-dimensional features and can extract the features based on a self-attention mechanism, so that information required by multi-task audio separation can be comprehensively captured, and the separation performance can be favorably improved. In addition, the exemplary embodiment of the present disclosure further adopts a dual-strategy framework of rough separation and fine adjustment, and after the framework completes the preliminary audio separation, the framework further pre-estimates the residual error of the rough separation result by combining the data before and after the rough separation, and adjusts the rough separation result according to the residual error to obtain the final separation result, which is helpful for improving the audio separation performance. By using the two-dimensional window self-attention network in both coarse separation and fine adjustment, the coarse separation capability and the residual estimation capability in fine adjustment can be further improved, and the audio separation performance can be further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a schematic structural diagram illustrating a two-dimensional windowed self-attention network in accordance with an exemplary embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a structure of a two-dimensional window self-attention layer according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram illustrating the structure of an audio separation model according to an exemplary embodiment of the present disclosure;

fig. 4 is a flowchart illustrating an audio separation method according to an exemplary embodiment of the present disclosure;

fig. 5 is a schematic diagram illustrating a structure of a coarse-divide network according to an exemplary embodiment of the present disclosure;

fig. 6 is a schematic diagram illustrating a structure of a residual compensation network according to an exemplary embodiment of the present disclosure;

fig. 7 is a block diagram illustrating an audio separation apparatus according to an exemplary embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device according to an example embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present disclosure better understood, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The embodiments described in the following examples do not represent all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In this case, the expression "at least one of the items" in the present disclosure means a case where three types of parallel expressions "any one of the items", "a combination of any plural ones of the items", and "the entirety of the items" are included. For example, "include at least one of a and B" includes the following three cases in parallel: (1) comprises A; (2) comprises B; and (3) comprises A and B. For another example, "at least one of the first step and the second step is performed", which means that the following three cases are juxtaposed: (1) executing the step one; (2) executing the step two; and (3) executing the step one and the step two.

It should be noted that the user information (including, but not limited to, user device information, user personal information, etc.) referred to in the present disclosure is information authorized by the user or sufficiently authorized by each party.

In real life, audio signals mainly include voice, music, background noise and the like. When these signals are mixed, the intelligibility of the audio signal is reduced, which can impair the subsequent audio understanding task. For example, in a music information retrieval task, foreground voice and background noise can reduce the accuracy of retrieval; in the speech recognition task, background noise can reduce the accuracy of recognition, and music in audio can cause confusion of recognition results. Meanwhile, the extracted energy of the background noise is an important evaluation basis in the voice quality evaluation task. Therefore, the multi-task audio separation is a technology capable of extracting three tracks (voice, music and noise) of mixed audio at a time, and is suitable for more application scenes compared with a single-task audio separation technology capable of extracting only two tracks.

In recent years, research efforts have focused primarily on single-tasking audio separation, such as speech enhancement and separation, music separation, and singing voice separation. These separation algorithms can be classified into frequency domain-based separation algorithms, time domain-based separation algorithms, and complex domain-based separation algorithms. The separation algorithm based on the frequency domain is to convert the separation problem into a supervised classification problem, and a mask value is obtained as a label by using the masking effect of the audio; then, learning a mapping function from mixed voice to a label by using a deep learning model; and finally, extracting a frequency domain unit where the target voice is located by using the label information. The time-domain based separation algorithm uses an end-to-end processing mode to input mixed audio and output estimated clean audio. The separation algorithm based on complex field is to deal with the problem of phase mismatch in the separation algorithm of frequency domain. The separation effect of various methods such as DCCRN (Deep Complex Convolution Cyclic Network), TSCN (Two-Stage Complex Network), and SDDNet (Simultaneous Denoising and Dereverberation Network) indicates that the model based on the Complex domain is more suitable for enhancing the voice from the background noise.

The multitask Audio Separation has been proposed, and is widely focused by researchers, and currently, the multitask Audio Separation mainly includes Separation models such as a Complex-Task Audio Separation base on multitask Network (comp-MTASSNet), a Multi-resolution Cross Network (MRX), and a coder-Attention Decoder Network (coder-Attention-Decoder Network) based on a coder-Attention-Decoder (coder-Decoder Network) of a coder-transformer. The Complex-MTASSNet and MRX models verify the feasibility of multitask Audio Separation, and exceed mainstream single-task Audio Separation models such as GCRN (Gated Convolutional Network), conv-TasNet (full-volume Time-domain Audio Separation Network), D3Net (dense Connected scaled detect, dense Connected extended detect), and demuscs (Music sound Separation Network) in performance. The EAD-former extracts audio features to realize separation by using The former (Convolution-enhanced deep self-attention network) which obtains SOTA (State-Of-The-Art, optimal performance) in The voice recognition task for reference, thereby greatly improving The separation performance. Meanwhile, the Swin-Transformer (Shifted Windows Transformer, moving window depth self-attention network) has better global and local feature capture capability by introducing strategies such as windowing, and the performance of the Swin-Transformer structure on the feature extraction of the audio is proved by the performance of the SOTA in the task of detecting and classifying the audio event.

The objective of the multi-task audio separation is to extract three tracks of mixed audio at a time through one model, while voice, music and noise have different acoustic characteristics, and the voice has short-time stationarity; music has a periodic and rich harmonic structure and high frequency components; the noise is relatively random and has no obvious structural characteristics. Therefore, the model not only needs to have the capability of extracting the global context dependency relationship so as to capture the long-time characteristics such as short-time stationarity and periodicity, but also needs to extract the local context dependency relationship and the similarity between time-frequency units so as to capture the characteristics such as harmonic structure, time-frequency distribution and the like, so that the voice, music and noise can be better distinguished. The existing method has the defects in extracting global dependence, local dependence and similarity among time-frequency units: the Complex-MTASSNet adopts a stacked convolution structure to extract separation features, so that the global correlation features captured by the Complex-MTASSNet are limited by the receptive field of the convolution network; MRX adopts a bidirectional long and short memory network to capture global dependence, and the scheme has high complexity and lacks the characteristics of local dependence and time-frequency unit similarity; the EAD-former can capture global and local dependency relationships, but lacks the ability to capture similarities between time-frequency units and is more complex.

According to the audio separation method and apparatus of the exemplary embodiments of the present disclosure, the most basic network structure of the application is a two-dimensional window self-attention network. The network is able to capture global context dependencies by using a multi-headed self-attention layer. By establishing the two-dimensional window self-attention layer, the two-dimensional first intermediate characteristics formed by the time-frequency units are converted into three-dimensional characteristics, namely, more detailed information is extracted for each time-frequency unit to upgrade the dimensions of the characteristics, so that the obtained three-dimensional characteristics contain richer local information compared with the two-dimensional first intermediate characteristics, and the similarity analysis among the time-frequency units is facilitated. Meanwhile, the two-dimensional window self-attention layer can provide a two-dimensional window for windowing the obtained three-dimensional features and can extract the features based on a self-attention mechanism, so that the local context dependency relationship can be captured. Therefore, the two-dimensional window self-attention network provided by the exemplary embodiment of the disclosure can comprehensively capture information required by multitask audio separation, and is helpful for improving separation performance. In addition, the exemplary embodiment of the disclosure further adopts a double-strategy framework of firstly performing rough separation and then performing fine adjustment, and the framework, after completing the initial audio separation, pre-estimates the residual error of the rough separation result by combining the data before and after the rough separation, and adjusts the rough separation result according to the residual error to obtain the final separation result, which is beneficial to improving the audio separation performance. By using the two-dimensional window self-attention network in both coarse separation and fine adjustment, the coarse separation capability and the residual estimation capability in fine adjustment can be further improved, and the audio separation performance can be further improved.

Hereinafter, an audio separating method and an audio separating apparatus according to an exemplary embodiment of the present disclosure will be described in detail with reference to fig. 1 to 8.

First, a two-dimensional windowed self-attention network is described.

Fig. 1 is a schematic diagram illustrating a structure of a two-dimensional window self-attention network according to an exemplary embodiment of the present disclosure. Referring to fig. 1, a two-dimensional Window Self-Attention network (WA-Transformer Block, window Attention-based Transformer Block) includes a multi-head Self-Attention layer (MSA) and a two-dimensional Window Self-Attention layer (2D-WA) in series, the multi-head Self-Attention layer and the two-dimensional Window Self-Attention layer are respectively used for extracting a first intermediate feature and a second intermediate feature, the first intermediate feature and the second intermediate feature are two-dimensional matrices composed of a plurality of time-frequency units, each time-frequency unit represents a feature of a pixel point on a spectrogram, the spectrogram is a graph capable of simultaneously displaying time-domain and frequency-domain information, and the two-dimensional matrix displays information of time-domain dimensions and frequency-domain dimensions. The two-dimensional window self-attention layer is used for extracting three-dimensional features from the first intermediate features and extracting second intermediate features from the three-dimensional features. Alternatively, referring to fig. 1, a full connection layer (FFN) is connected before the multi-head self-attention layer and after the two-dimensional window self-attention layer, and a residual connection and normalization process (Add & Norm, add and normalization) is performed after each layer, that is, the sum of the input features and the output features of the corresponding layer is calculated and normalized to be used as the input of the next layer.

As an example, the processing of a two-dimensional window self-attention network can be formulated as:

z″＝2DWA(z′)；output＝LN(z″+0.5×FFN(z″))。

wherein, FFN (), MSA (), 2DWA (), LN (), and BN () respectively represent the full connection layer, multi-headed attention layer, two-dimensional windowed self-attention layer, layer normalization, and batch normalization. For residual connection, in the above formula, the input and output feature weights of the full connection layer are 1 and 0.5, respectively, the input and output feature weights of the multi-head self-attention layer are 1, and the input and output feature weights of the two-dimensional window self-attention layer are 0 and 1, respectively.

Fig. 2 is a schematic view illustrating a structure of a two-dimensional window self-attention layer according to an exemplary embodiment of the present disclosure. Referring to FIG. 2, a two-dimensional window self-attention layer is used to perform the following steps for a first intermediate feature:

firstly, for each Time-Frequency unit x (t, f) in the first intermediate feature, a set number of Time-Frequency units with a set scale spaced from the Time-Frequency unit in the first intermediate feature are extracted by using a scale Time Frequency unit extractor (TFBand, arcs Time Frequency unit-Band) as a feature vector of the Time-Frequency feature, and the formula can be expressed as follows:

x(t,f)＝[…,x(t+i×d _t ,f+j×d _f ),…]；

i＝0,1,2,3,...,k _t -1；j＝0,1,2,3,...,k _f -1。

wherein t represents a time domain coordinate, k _t Representing the number of extractions in the time domain, d _t Representing the extraction scale in the time domain, i.e. every d in time-frequency _t Extracting 1 time-frequency unit from each time-frequency unit, d _t =1 represents a continuous extraction without intervals, i.e. time-frequency cells of the same scale as the input intermediate features are extracted. All three are integers, and k _t And d _t Are all greater than 0. Optionally, to extract rich information, a time-frequency singleton, i.e., d, of a scale different from the input intermediate features may be extracted _t Not less than 2.f denotes the frequency domain coordinate, k _f Representing the number of extractions in the frequency domain, d _f Represents the extraction scale in the frequency domain, and is the same as the time domain. Total extractable K = K _t ×k _f And a time-frequency unit.

By way of example, referring to FIG. 2, for the first time-frequency unit, k _t ＝k _f ＝3，d _t ＝d _f =2, 9 time-frequency units with an interval of 1 in the time domain and the frequency domain from the time-frequency unit are extracted.

Merging the feature vectors of all time-frequency units in the first intermediate feature to form a three-dimensional feature, namely combining one x belonging to R ^T×F Is processed into one

The tensor of (2). As an example, referring to fig. 2, t = f =9, the scale time-frequency cell extractor processes a two-dimensional square array into a three-dimensional rectangular parallelepiped array.

Then, windowing is performed on the three-dimensional features by using a Window divider (WP, window Partition) to obtain a plurality of three-dimensional feature blocks (Block). Referring to fig. 2, the three-dimensional features are evenly divided into 9 3 × 3 three-dimensional feature blocks.

And finally, extracting two-dimensional features of each three-dimensional feature block based on a multi-head self-attention mechanism, and combining the corresponding two-dimensional features according to the positions of the three-dimensional feature blocks in the three-dimensional features to obtain second intermediate features. Specifically, referring to fig. 2, taking the third three-dimensional feature block in the second row as an example, the three-dimensional feature block may be stretched (strech), so as to convert the 3 × 3 three-dimensional feature block into a 9 × 1 three-dimensional feature block, and then a Window-based multi-head Self-Attention slice (W-MSA) is used to perform a calculation to obtain another 9 × 1 three-dimensional feature block, which belongs to the mature technology in the field and is not described herein again. It should be understood that the window-based multi-headed self-attention layer herein differs from the multi-headed self-attention layer in the two-dimensional window self-attention network in that the dimensions of the input features are different, and the features are processed in a similar manner. Three 1 × 9 three-dimensional feature blocks following the 9 × 1 three-dimensional feature block in fig. 2 represent Q (query), K (key), and V (value), respectively, in the multi-head self-attention mechanism. And then, processing through a full connection layer to obtain 9 × 1 two-dimensional features, reconverting the two-dimensional features into 3 × 3 two-dimensional features, and when combining the two-dimensional features, placing the two-dimensional features at the positions of the second row and the third column to finally obtain 9 × 9 two-dimensional features as output intermediate features.

In general, the processing of a two-dimensional window self-attention layer can be formulated as:

wherein LN () represents Normalization processing by the method of Layer Normalization.

The following describes the execution flow of the dual strategy framework, i.e. the entire audio separation model.

Fig. 3 is a schematic structural diagram illustrating an audio separation model according to an exemplary embodiment of the present disclosure. Referring to fig. 3, the audio Separation model is executed mainly by a Separation stage (Separation stage) using a Separation network (Separator) and a Residual compensation stage (Residual compensation stage) using a Residual compensation network (Residual compensation), both of which include the two-dimensional window self-attention network.

Fig. 4 is a flowchart illustrating an audio separation method according to an exemplary embodiment of the present disclosure. It should be understood that the audio separation method according to the exemplary embodiment of the present disclosure may be implemented in a terminal device such as a smartphone, a tablet computer, a Personal Computer (PC), or may be implemented in a device such as a server.

Referring to fig. 4, in step 401, a mixed audio complex spectrum of audio to be separated and a coarsely divided audio complex spectrum of at least two tracks are obtained based on a coarsely divided network of an audio separation model for the audio to be separated. That is, not only the coarse audio complex spectrum of each track but also the mixed audio complex spectrum of the whole audio is obtained. It should be understood that the exemplary embodiments of the present disclosure have the ability to separate speech, music and background noise, but are limited by the actual content of the audio to be separated, and may only separate out the data of two tracks, so are expressed herein as "at least two" tracks.

Fig. 5 is a schematic diagram illustrating a structure of a coarse separation network according to an exemplary embodiment of the present disclosure. Referring to fig. 5, optionally, the rough separation network includes a plurality of serial two-dimensional window self-attention networks, the structure is simple, and a more accurate rough separation audio complex spectrum can be obtained by repeatedly extracting intermediate features, which is helpful for balancing the structural complexity and the audio frequency separation effect. As an example, holes may be injected from the attention network in each two-dimensional window to increase the receptive field, with a hole rate (contrast rate) of, for example, 2 ⁿ ^-1 Denotes injecting 2 between two adjacent convolution kernels ^n-1 1 hole, where n represents the sequence number of each two-dimensional window self-attention network in the rough-scoring network.

Alternatively, referring to fig. 3, step 401 includes the steps of:

firstly, separating a mixed audio amplitude spectrum Y from audio to be separated _mag-mix And mixed audio complex spectrum Y _RI-mix Specifically, the method may be implemented by Short-Time Fourier Transform (STFT), where the mixed audio complex spectrum includes a mixed audio phase Y _phase-mix ＝arctan(Y _I-mix /Y _R-mix ) Wherein Y is _R-mix And Y _I-mix Respectively representing mixed audio complex spectraReal and imaginary parts of (c).

Then, the mixed audio magnitude spectrum Y _mag-mix And inputting the signals into a rough separation network to obtain rough separation audio amplitude spectrums of at least two tracks. Specifically, the rough separation network needs to extract intermediate features repeatedly, and then convert the finally obtained intermediate features into rough separation audio magnitude spectra of at least two tracks. As an example, a coarse audio magnitude spectrum of a speech track may be determined

Coarse frequency division audio amplitude spectrum of music track

And the coarse audio frequency amplitude spectrum of the background noise track

The mixed audio amplitude spectrum serving as the real number spectrum is input into the rough separation network, the amplitude spectrum is separated firstly and then converted into the complex spectrum, and the complex spectrum is not separated directly, so that the calculation error can be reduced, and the rough separation accuracy can be improved.

Finally, for each track, according to the rough audio amplitude spectrum and the mixed audio phase Y _phase-mix And determining a coarse audio complex spectrum. As an example, a coarsely divided audio complex spectrum for a speech track can be determined

Coarse audio complex spectrum of music track

Coarse audio complex spectrum with background noise track

Specifically, to

For example, the real part

Imaginary part

The other tracks are treated the same way.

Referring back to fig. 4, at step 402, for the coarsely divided audio complex spectrum and the mixed audio complex spectrum of the at least two tracks, a complex spectral residual of the at least two tracks is obtained based on a residual compensation network of the audio separation model. As an example, the complex spectral residual Res of a voice track may be obtained _speech Complex spectral residual Res of music tracks _music And the complex spectral residual Res of the background noise track _noise . By combining the complex spectrum data before and after rough separation and the residual error compensation network containing the two-dimensional window self-attention network, the residual error of the rough separation result can be accurately estimated and used as the compensation of the rough separation result, so that the rough separation result can be adjusted according to the residual error in the subsequent steps, the final separation result is obtained, and the audio separation performance can be improved.

Fig. 6 is a schematic diagram illustrating a structure of a residual error compensation network according to an exemplary embodiment of the present disclosure. Referring to fig. 6, the residual compensation network optionally includes an encoding network (Encoder), two-dimensional window self-attention networks (Stacked WA-Transformer blocks, which represent a two-dimensional window self-attention network using a plurality of serial packets), and a decoding network (Decoder), where reshape is a method for processing a feature tensor, and is used to reconstruct a shape parameter of a feature tensor, where the shape parameter represents a length of the feature tensor in each dimension, that is, a total number of elements in each dimension. As an example, the residual error compensation network is a network obtained by replacing a Long Short-Term Memory network (LSTM) in a Gated Convolutional loop network (GCRN) with a two-dimensional window self-attention network, the coding network includes a plurality of serial Convolutional Gated Linear units (ConvGLU), the decoding network can be divided into two branches for processing a real part and an imaginary part, and each decoding network includes a plurality of serial deconvolved Linear units (DeconvGLU). The single-coding double-decoding structure can improve the estimation accuracy of a real part and an imaginary part, the convolution unit with a gating mechanism can further improve the modeling capability of the network, and the two-dimensional window self-attention network is combined to ensure that the complex spectrum residual error can be estimated accurately.

Optionally, step 402 comprises: for each track, determining a difference value between the mixed audio complex spectrum and the coarse-divided audio complex spectrum as an initial residual error; and inputting the initial residual error of the at least two tracks and the mixed audio complex spectrum into a residual error compensation network to obtain the complex spectrum residual error of the at least two tracks. Specifically, similar to the rough separation network, the residual error compensation network needs to extract the intermediate features repeatedly, and then convert the finally obtained intermediate features into complex spectral residual errors of at least two tracks. As an example, the initial residual can be formulated as:

that is, the difference between the mixed audio complex spectrum and the coarse audio complex spectrum of each track is used as the initial residual of the corresponding track.

Accordingly, the input to the residual compensation network can be formulated as:

InRes _1,2,3 ＝[InRes _speech ,InRes _music ,InRes _noise ,Y _RI-mix ]。

accordingly, the output complex spectral residual can be formulated as:

OutRes _R-1,2,3 ,OutRes _I-1,2,3 ＝RCN(InRes _1,2,3 )。

wherein, outters _R-1,2,3 Includes an outters _R-speech 、outRes _R-music 、outRes _R-noise Three parts, respectively representing the real parts of complex spectrum residual errors of voice, music and background noise; outRots _I-1,2,3 Includes an outters _I-speech 、outRes _I-music 、outRes _I-noise Three parts, complex spectrum residual errors respectively representing voice, music and background noiseRCN () represents a residual compensation network.

Referring back to fig. 4, at step 403, for each track, an audio complex spectrum is determined from the coarsely divided audio complex spectrum and the complex spectral residuals. Referring to fig. 3, as an example, the two may be summed, as an audio complex spectrum for each track, which may be formulated as:

in step 404, the audio complex spectra of at least two tracks are respectively converted into audio signals. As an example, this may be achieved by an inverse fourier transform.

Optionally, the audio separation model is trained by the following steps: acquiring a sample audio, wherein the sample audio is formed by overlapping pure audio signals of at least two tracks, and a pure audio complex spectrum of each pure audio signal is acquired; for sample audio, obtaining estimated audio complex spectrums and estimated audio signals of at least two tracks of the sample audio based on an audio separation model to be trained; determining a loss value according to the pure audio complex spectrums, the pre-estimated audio complex spectrums, the pure audio signals and the pre-estimated audio signals of at least two tracks; and adjusting parameters of the audio separation model to be trained based on the loss value to obtain the audio separation model. The audio separation model can be trained in a supervision mode, and the reliability of a training result is guaranteed. The loss value is determined by combining the complex spectrum and the audio signal, the separation effect of the audio separation model to be trained can be comprehensively evaluated, and the training efficiency is further improved.

Optionally, determining the loss value according to the pure audio complex spectrum, the estimated audio complex spectrum, the pure audio signal, and the estimated audio signal of the at least two tracks includes: determining a first loss value according to the pure audio complex spectrum and the pre-estimated audio complex spectrum of each track; determining a second loss value according to the pure audio complex spectrum of each track and the estimated audio complex spectra of other tracks except the current track in the at least two tracks; determining a third loss value according to the pure audio signal and the pre-estimated audio signal of each track; and determining a total loss value according to the first loss value, the second loss value and the third loss value, wherein the total loss value is positively correlated with the first loss value, and the total loss value is negatively correlated with the second loss value and the third loss value. The first loss value and the third loss value can directly refer to pure audio data, the separation effect of the audio separation model to be trained is evaluated from the angles of the complex spectrum and the audio signal, in addition, the second loss value also compares the pure audio complex spectrum of the current track with the pre-estimated audio complex spectrum of other tracks, the discrimination capability of the model can be enhanced, and the training effect can be further promoted.

As an example, the first loss value is an average absolute error, which can be expressed as:

the second loss value is a discrimination loss, which can be formulated as:

the third loss value is the Signal-to-Noise Ratio (SNR), which can be formulated as:

the loss value can be formulated as:

the number of extracted tracks of N is, for example, 3.S, S' represents the clean audio complex spectrum and the estimated audio complex spectrum, respectively. s, s' represent the clean audio signal and the estimated audio signal, respectively. λ, α are hyper-parameters that balance each loss term in the loss values, making these loss terms numerically on the same scale. T represents the maximum number of training times, and T represents the number of iterations of the current training.

To verify the performance of the audio separation model of the exemplary embodiment of the present disclosure, the amount of operations per second (MAC/s, MAC is Memory Access Cost, amount of video/Memory Access) and the time consumption (real-time rate) required for processing audio per second on the GPU (Nvidia 2080 Ti) of the standard system, and the size of the model parameter (amount of occupied space) are counted. Under the condition of the parameter quantity of 8.62M, the audio separation model of the exemplary embodiment of the disclosure is superior to other models in performance, and has the advantages of optimal real-time rate and small calculation quantity. Under the condition that the parameter quantity is equivalent to that of the EAD-adapter, the performance is obviously superior to that of other models. In summary, the present disclosure achieves the best signal-to-noise ratio improvement over all three tracks, 13.86dB, 12.22dB, and 11.21dB over voice, music, and noise tracks, respectively. This illustrates the effectiveness and advancement of the present disclosure.

Fig. 7 is a block diagram illustrating an audio separating apparatus according to an exemplary embodiment of the present disclosure. It should be understood that the audio separation apparatus according to the exemplary embodiments of the present disclosure may be implemented in a terminal device such as a smart phone, a tablet computer, a Personal Computer (PC) in a software, hardware, or a combination of software and hardware, and may also be implemented in a device such as a server.

Referring to fig. 7, the audio separating apparatus includes a rough separation unit 701, a residual unit 702, a compensation unit 703, and a conversion unit 704.

The rough separation unit 701 may obtain a mixed audio complex spectrum of the audio to be separated and rough separation audio complex spectra of at least two tracks based on a rough separation network of an audio separation model.

Optionally, the coarse mesh network comprises a plurality of serial two-dimensional windowed self-attentive meshes.

Optionally, the rough separation unit 701 may further separate a mixed audio magnitude spectrum and a mixed audio complex spectrum from the audio to be separated, where the mixed audio complex spectrum includes a mixed audio phase; inputting the mixed audio amplitude spectrum into a rough separation network to obtain rough separation audio amplitude spectrums of at least two tracks; and for each track, determining a coarse audio complex spectrum according to the coarse audio amplitude spectrum and the mixed audio phase.

The residual unit 702 may obtain the complex spectral residuals of the at least two tracks based on a residual compensation network of the audio separation model for the coarse audio complex spectrum and the mixed audio complex spectrum of the at least two tracks.

Optionally, the residual error compensation network comprises an encoding network, a two-dimensional window self-attention network, and a decoding network.

Optionally, the residual unit 702 may further determine, for each track, a difference value between the mixed audio complex spectrum and the coarse-divided audio complex spectrum as an initial residual; and inputting the initial residual error of the at least two tracks and the mixed audio complex spectrum into a residual error compensation network to obtain the complex spectrum residual error of the at least two tracks.

The compensation unit 703 may determine an audio complex spectrum from the coarsely divided audio complex spectrum and the complex spectrum residual for each track.

The converting unit 704 may convert the audio complex spectra of at least two tracks into audio signals, respectively.

The rough separation network and the residual error compensation network both comprise a two-dimensional window self-attention network, the two-dimensional window self-attention network comprises a multi-head self-attention layer and a two-dimensional window self-attention layer which are connected in series, the multi-head self-attention layer and the two-dimensional window self-attention layer are respectively used for extracting a first intermediate feature and a second intermediate feature, the first intermediate feature and the second intermediate feature are two-dimensional matrixes formed by a plurality of time-frequency units, and the two-dimensional window self-attention layer is used for extracting a three-dimensional feature from the first intermediate feature and extracting the second intermediate feature from the three-dimensional feature.

Optionally, the two-dimensional window self-attention layer is for performing the following steps for the first intermediate feature: for each time-frequency unit in the first intermediate characteristic, extracting a set number of time-frequency units with set scales at intervals with the current time-frequency unit in the first intermediate characteristic to serve as the characteristic vector of the current time-frequency unit, and converging the characteristic vectors of all the time-frequency units in the first intermediate characteristic to form a three-dimensional characteristic; performing windowing processing on the three-dimensional features to obtain a plurality of three-dimensional feature blocks; and extracting two-dimensional features of each three-dimensional feature block based on a multi-head self-attention mechanism, and combining the corresponding two-dimensional features according to the positions of the three-dimensional feature blocks in the three-dimensional features to obtain second intermediate features.

Optionally, the audio separation model is trained by the following steps: acquiring a sample audio, wherein the sample audio is formed by overlapping pure audio signals of at least two tracks, and a pure audio complex spectrum of each pure audio signal is acquired; for sample audio, obtaining estimated audio complex spectrums and estimated audio signals of at least two tracks of the sample audio based on an audio separation model to be trained; determining a loss value according to the pure audio complex spectrums, the pre-estimated audio complex spectrums, the pure audio signals and the pre-estimated audio signals of at least two tracks; and adjusting parameters of the audio separation model to be trained based on the loss value to obtain the audio separation model.

Optionally, determining the loss value according to the pure audio complex spectrum, the estimated audio complex spectrum, the pure audio signal, and the estimated audio signal of the at least two tracks includes: determining a first loss value according to the pure audio complex spectrum and the pre-estimated audio complex spectrum of each track; determining a second loss value according to the pure audio complex spectrum of each track and the estimated audio complex spectra of other tracks except the current track in the at least two tracks; determining a third loss value according to the pure audio signal and the pre-estimated audio signal of each track; and determining a total loss value according to the first loss value, the second loss value and the third loss value, wherein the total loss value is positively correlated with the first loss value, and the total loss value is negatively correlated with the second loss value and the third loss value.

With regard to the apparatus in the above-described embodiment, the specific manner in which each unit performs the operation has been described in detail in the embodiment related to the method, and will not be described in detail here.

Referring to fig. 8, an electronic device 800 includes at least one memory 801 and at least one processor 802, the at least one memory 801 having stored therein a set of computer-executable instructions that, when executed by the at least one processor 802, perform an audio separation method according to an exemplary embodiment of the present disclosure.

By way of example, the electronic device 800 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. Here, the electronic device 800 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above instructions (or sets of instructions) either individually or in combination. The electronic device 800 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the electronic device 800, the processor 802 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

The processor 802 may execute instructions or code stored in the memory 801, wherein the memory 801 may also store data. The instructions and data may also be transmitted or received over a network via a network interface device, which may employ any known transmission protocol.

The memory 801 may be integrated with the processor 802, for example, with RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 801 may include a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 801 and the processor 802 may be operatively coupled or may communicate with each other, such as through I/O ports, network connections, etc., so that the processor 802 can read files stored in the memory.

Further, the electronic device 800 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 800 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, there may also be provided a computer-readable storage medium, in which instructions, when executed by at least one processor, cause the at least one processor to perform an audio separation method according to an exemplary embodiment of the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drive (HDD), solid State Disk (SSD), card storage (such as a multimedia card, a Secure Digital (SD) card or an extreme digital (XD) card), a magnetic tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a solid state disk, and any other device configured to store and any associated data, and to enable a computer program and any associated data processing or data structures to be executed by a computer. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, the computer program product comprising computer instructions which, when executed by at least one processor, cause the at least one processor to perform an audio separation method according to an exemplary embodiment of the present disclosure.

According to the audio separation method, apparatus, electronic device, and computer-readable storage medium according to the exemplary embodiments of the present disclosure, the most basic network structure of an application is a two-dimensional window self-attention network. The network is able to capture global context dependencies by using a multi-headed self-attention layer. By establishing the two-dimensional window self-attention layer, the two-dimensional first intermediate features formed by the time-frequency units are converted into three-dimensional features, namely, more detailed information is extracted for each time-frequency unit to increase the dimension of the features, so that the obtained three-dimensional features contain richer local information compared with the two-dimensional first intermediate features, and the similarity analysis among the time-frequency units is facilitated. Meanwhile, the two-dimensional window self-attention layer can provide a two-dimensional window for windowing the obtained three-dimensional features and can extract the features based on a self-attention mechanism, so that information required by multi-task audio separation can be comprehensively captured, and the separation performance can be favorably improved. In addition, the exemplary embodiment of the present disclosure further adopts a dual-strategy framework of rough separation and fine adjustment, and after the framework completes the preliminary audio separation, the framework further pre-estimates the residual error of the rough separation result by combining the data before and after the rough separation, and adjusts the rough separation result according to the residual error to obtain the final separation result, which is helpful for improving the audio separation performance. By using the two-dimensional window self-attention network in both coarse separation and fine adjustment, the coarse separation capability and the residual estimation capability in fine adjustment can be further improved, and the audio separation performance can be further improved.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An audio separation method, characterized in that the audio separation method comprises:

the method comprises the steps that a mixed audio complex spectrum of audio to be separated and coarse-divided audio complex spectra of at least two tracks are obtained on the basis of a coarse-divided network of an audio separation model;

obtaining the complex spectrum residuals of the at least two tracks based on a residual compensation network of the audio separation model for the coarse audio complex spectrum and the mixed audio complex spectrum of the at least two tracks;

for each track, determining an audio complex spectrum according to the coarse audio complex spectrum and the complex spectrum residual error;

respectively converting the audio complex spectrums of the at least two tracks into audio signals;

the rough separation network and the residual error compensation network both comprise a two-dimensional window self-attention network, the two-dimensional window self-attention network comprises a multi-head self-attention layer and a two-dimensional window self-attention layer which are connected in series, the multi-head self-attention layer and the two-dimensional window self-attention layer are respectively used for extracting a first intermediate feature and a second intermediate feature, the first intermediate feature and the second intermediate feature are both two-dimensional matrixes formed by a plurality of time-frequency units, and the two-dimensional window self-attention layer is used for extracting a three-dimensional feature from the first intermediate feature and extracting the second intermediate feature from the three-dimensional feature.

2. The audio separation method of claim 1, wherein the two-dimensional window self attention layer is for performing the following steps on the first intermediate feature:

for each time-frequency unit in the first intermediate characteristic, extracting a set number of time-frequency units with a set scale spaced from the current time-frequency unit in the first intermediate characteristic to serve as characteristic vectors of the current time-frequency unit, and converging the characteristic vectors of the time-frequency units in the first intermediate characteristic to form the three-dimensional characteristic;

performing windowing processing on the three-dimensional features to obtain a plurality of three-dimensional feature blocks;

and extracting two-dimensional features from each three-dimensional feature block based on a multi-head self-attention mechanism, and combining the corresponding two-dimensional features according to the positions of the three-dimensional feature blocks in the three-dimensional features to obtain the second intermediate features.

3. The audio separation method of claim 1, wherein the obtaining of the coarse audio complex spectrums and the mixed audio complex spectrums of the at least two tracks of the audio to be separated based on a coarse network of an audio separation model comprises:

separating a mixed audio amplitude spectrum and the mixed audio complex spectrum from the audio to be separated, wherein the mixed audio complex spectrum comprises a mixed audio phase;

inputting the mixed audio amplitude spectrum into the rough separation network to obtain rough separation audio amplitude spectrums of the at least two tracks;

and for each track, determining the coarse audio complex spectrum according to the coarse audio magnitude spectrum and the mixed audio phase.

4. The audio separation method of claim 1, wherein the obtaining the complex spectral residuals of the at least two tracks based on a residual compensation network of the audio separation model for the coarsely divided audio complex spectrum and the mixed audio complex spectrum of the at least two tracks comprises:

for each track, determining a difference value between the mixed audio complex spectrum and the coarse-divided audio complex spectrum as an initial residual error;

inputting the initial residuals of the at least two tracks and the mixed audio complex spectrum into the residual compensation network to obtain the complex spectrum residuals of the at least two tracks.

5. The audio separation method of any one of claims 1 to 4, wherein the audio separation model is trained by:

acquiring sample audio, wherein the sample audio is formed by overlapping pure audio signals of at least two tracks, and a pure audio complex spectrum of each pure audio signal is acquired;

for the sample audio, obtaining estimated audio complex spectrums and estimated audio signals of the at least two tracks of the sample audio based on an audio separation model to be trained;

determining a loss value according to the pure audio complex spectrum, the estimated audio complex spectrum, the pure audio signal and the estimated audio signal of the at least two tracks;

and adjusting parameters of the audio separation model to be trained based on the loss value to obtain the audio separation model.

6. The audio separation method of claim 5 wherein determining a loss value based on the clean audio complex spectrum, the estimated audio complex spectrum, the clean audio signal, the estimated audio signal for the at least two tracks comprises:

determining a first loss value according to the pure audio complex spectrum and the pre-estimated audio complex spectrum of each track;

determining a second loss value according to the pure audio complex spectrum of each track and the estimated audio complex spectra of other tracks except the current track in the at least two tracks;

determining a third loss value according to the pure audio signal and the pre-estimated audio signal of each track;

and determining a total loss value according to the first loss value, the second loss value and the third loss value, wherein the total loss value is positively correlated with the first loss value, and the total loss value is negatively correlated with the second loss value and the third loss value.

7. An audio separating apparatus, comprising:

the rough separation unit is configured to execute a rough separation network of an audio separation model on the basis of audio to-be-separated audio to obtain a mixed audio complex spectrum of the audio to be separated and rough separation audio complex spectrums of at least two tracks;

a residual unit configured to perform a residual compensation network of the audio separation model on the coarsely divided audio complex spectrum and the mixed audio complex spectrum of the at least two tracks to obtain complex spectral residuals of the at least two tracks;

a compensation unit configured to perform determining an audio complex spectrum from the coarsely divided audio complex spectrum and the complex spectrum residual for each track;

a conversion unit configured to perform a conversion of the audio complex spectra of the at least two tracks into audio signals, respectively;

8. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the audio separation method of any of claims 1 to 6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform the audio separation method of any of claims 1 to 6.