CN113850246B

CN113850246B - Method and system for sound source positioning and sound source separation based on dual coherent network

Info

Publication number: CN113850246B
Application number: CN202111441409.3A
Authority: CN
Inventors: 李昊沅
Original assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Current assignee: Hangzhou Yizhi Intelligent Technology Co ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-02-18
Anticipated expiration: 2041-11-30
Also published as: CN113850246A

Abstract

The invention discloses a method and a system for sound source positioning and sound source separation based on a dual coherent network, belonging to the field of image-audio multimode. The method mainly comprises the following steps: 1) the method comprises the steps of obtaining an audio and video data set, selecting a pair of videos belonging to different sound domains, extracting corresponding single-source audio and image information, and calculating mixed audio. 2) And respectively carrying out characteristic coding on the audio and the image to obtain audio and image characteristics. 3) And sending the mixed audio and the image characteristics to a sound source separation module of a dual consistent network together to separate single-source audio. 4) And sending the image and the corresponding audio characteristics to a sound source positioning module of a dual consistent network to obtain a sound production object in the image. Compared with the traditional method in the tasks of sound source positioning and sound source separation, the method provided by the invention treats the two tasks as dual tasks, simultaneously completes the dual tasks by using the same framework, and mutually enhances the performance in the training process by utilizing the characteristics of the two tasks, thereby finally improving the effect on the two tasks.

Description

Method and system for sound source positioning and sound source separation based on dual coherent network

Technical Field

The invention relates to the field of image-audio multimode, in particular to a method for positioning and separating a sound source based on a dual coherent network.

Background

The vision and the hearing are important ways for human beings to perceive the world, can identify and separate the sounds emitted by various objects, and can find the sound-emitting objects in a complex scene, so that the human beings have strong perception and are the basis for making follow-up complex decisions. Therefore, the machine has the capability of separating and positioning the sound source, and is a necessary way for realizing artificial intelligence.

Much of the current research is mainly focused on two separate tasks, namely sound source localization, visually guided sound separation, which, although they have achieved some success, still have some unsolved problems:

1) in the current visual guidance sound separation model, a specific image is required to query the sound corresponding to an object in the image, but when a plurality of objects exist in the image, the model cannot know which object corresponds to the sound, and the performance is poor.

2) At present, most models corresponding to two tasks cannot be processed simultaneously by one set of framework, and when audio needs to be positioned and separated simultaneously, the models are directly superposed, so that the models are complex and the calculation speed is low.

Disclosure of Invention

The invention provides a self-supervision dual-coincidence network, simultaneously utilizes the characteristics of the sound source positioning and sound source separation tasks, adopts the same framework to realize the sound source positioning and sound source separation tasks, and achieves the effect of mutual enhancement.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

one of the objectives of the present invention is to provide a method for sound source localization and sound source separation based on dual coherent network, comprising the following steps:

1) acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;

2) respectively encoding the original audio and the frame image, and the mixed audio and the spliced image;

3) performing vocal tract detection on the coded mixed audio features to obtain different vocal tract detection results contained in the mixed audio;

4) constructing a dual consistent network comprising a sound source separation network and a sound source positioning network, taking the characteristics of mixed audio and coded spliced images as the input of the sound source separation network, separating audio corresponding to different sound domains from the mixed audio according to the detection results of the different sound domains, and calculating the separation loss;

the coded original audio and the frame image are used as the input of a sound source positioning network, a sound object is positioned from the frame image, and the matching loss is calculated;

5) performing end-to-end multi-task training on the dual coherent network, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process; and realizing sound source positioning and sound source separation by using the trained dual consistent network.

Another object of the present invention is to provide a sound source localization and separation system for implementing the above method, comprising:

the data acquisition module is used for acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;

an audio encoding module for encoding the original audio and the mixed audio;

the image coding module is used for coding the frame image and the spliced image;

the sounding domain detection module is used for carrying out sounding domain detection on the coded mixed audio features to obtain different sounding domain detection results contained in the mixed audio;

a sound source separation module: the system is used for separating the audios corresponding to different sound domains from the mixed audio according to the mixed audio, the characteristics of the coded spliced images and the detection results of the different sound domains;

the sound source positioning module: the method is used for positioning and obtaining the sounding object from the frame image according to the encoded original audio and the frame image.

And the multi-task training module is used for performing end-to-end multi-task training on the sounding domain detection module, the sound source separation module and the sound source positioning module, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process.

Compared with the prior art, the invention has the following beneficial effects.

(1) The invention regards sound source positioning and sound source separation as dual tasks, thus using the same simple framework to solve the two tasks and obtaining better effect. In the traditional scheme, one task is basically solved singly, and the model is complex and cannot be directly superposed.

(2) The invention designs the dual consistent network by utilizing the characteristic of the dual task of sound source positioning and sound source separation, and can respectively enhance the positioning and separating performances by utilizing the separated audio and the positioned object, thereby achieving the effect of dual consistency and mutual promotion of the two tasks and obtaining better effect on the two tasks.

(3) In the invention, a method based on sound domain separation is designed in a sound source separation module, namely, when audio is separated, the separation results of all sound domains can be predicted, and the traditional method refers to the prediction of the separation result of a given image query; the invention solves the problem that when a plurality of objects exist in the image, the model cannot know the sound corresponding to the separated objects, so that the performance is poor.

Drawings

Fig. 1 is a schematic diagram illustrating a method for sound source localization and sound source separation based on dual coherent networks according to an embodiment of the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 1, the method for sound source localization and sound source separation based on dual coherent network of the present invention mainly comprises the following steps.

Acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;

step two, respectively encoding the original audio and the frame image, and the mixed audio and the spliced image;

thirdly, performing vocal tract detection on the coded mixed audio features to obtain different vocal tract detection results contained in the mixed audio;

step four, constructing a dual consistent network comprising a sound source separation network and a sound source positioning network, taking the characteristics of the mixed audio and the coded spliced image as the input of the sound source separation network, separating the audio corresponding to different sound domains from the mixed audio according to the detection results of the different sound domains, and calculating the separation loss;

step five, performing end-to-end multi-task training on the dual-coincidence network, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process; and realizing sound source positioning and sound source separation by using the trained dual consistent network.

Step one is used for constructing a training set.

In this embodiment, a pair of videos Video _1 and Video _2 containing different sound domains (musical instruments) is randomly selected, and the corresponding audio a is randomly extracted₁、A₂And a certain frame image V₁、V₂。

The mixed audio is obtained by splicing audio with the same length randomly extracted from a pair of videos in a time dimension, and in the embodiment, the mixed audio a is obtained by utilizing a two-segment audio structure₁₂=A₁+A₂. The spliced image is obtained by splicing two frames of images corresponding to two audio frequencies in the horizontal direction after the sizes of the two frames of images are changed₁₂=[V₁,V₂]。

And step two is used for coding the audio and the image.

In this embodiment, the encoding method for the original audio and the mixed audio is: firstly, carrying out short-time Fourier transform on original audio or mixed audio to be coded; and then, encoding the short-time Fourier transform result by using an audio encoder. The audio encoder can be realized by adopting the existing network such as ResNet.

The method for coding the original frame image and the spliced image comprises the following steps: the image is processed directly with an image encoder.

And step three, detecting the vocal range.

In this embodiment, the encoded mixed audio features are subjected to two-dimensional average pooling, then subjected to matrix conversion and activation function processing to obtain probabilities in each sound domain, and the two sound domains with the highest probabilities are used as prediction results, and parameters are updated by using a binary cross entropy loss function.

And step four, separating and positioning functions of the sound source separating network and the sound source positioning network are executed.

A. The sound source separation network specifically comprises:

carrying out short-time Fourier transform on the mixed audio to obtain a frequency spectrum, and utilizing a segmentation network to segment the frequency spectrum of the mixed audio;

performing two-dimensional average pooling on the coded spliced image features, interacting a pooling result with an audio segmentation result, and performing matrix conversion and activation function processing to obtain a predicted spectrum mask;

and multiplying the predicted spectrum mask with the spectrum of the mixed audio, extracting the spectrum of the predicted sound domain according to the prediction result of the sound domain, and obtaining the audio which is separated from the mixed audio and corresponds to different sound domains through inverse short-time Fourier transform.

B. The sound source positioning network specifically comprises:

firstly, performing maximum pooling on coded original audio features, positioning the coded frame image features by using a result after the maximum pooling and the coded frame image features, calculating the probability of a sound-producing object corresponding to each feature point in the frame image features, and taking an original frame image area corresponding to a communicated area of all feature points with the probability greater than a threshold value as a positioning result to realize positioning from the frame image to obtain the sound-producing object.

In order to train the sound source separation network, the invention marks a real spectrum mask in the frequency spectrum of the mixed audio according to a real sounding domain and calculates a binary cross entropy loss function update parameter of the prediction spectrum mask and the real spectrum mask.

In addition, the consistency before and after separation needs to be ensured in the training process of the sound source separation network, and the loss of the separation consistency is calculated:

loss__A=mean(|sum_C(mask__pred*S₁₂)-S₁₂|)

in the formula, loss_ARepresents the loss of separation consistency, mean () represents the mean, sum_C(.) shows summing over the domain dimension, mask _ \_predRepresenting a predicted spectral mask, S₁₂The spectrum of the mixed audio obtained after short-time Fourier transform is represented, | represents the L1 norm.

In order to train a sound source positioning network, the invention matches the original audio features after the maximum pooling with the frame image features after the encoding, and calculates the matching loss:

in the formula, loss_{_M}Representing the match loss, mean () representing the mean, sum () representing the vector sum;

representing the original audio features after the ith pooling, i ∈ [1,2 ]]；

Representing the ith coded frame image feature;

in addition, the consistency before and after positioning is also ensured in the training process of the sound source positioning network, and the positioning consistency loss is calculated:

in the formula, loss_{_V}Indicating a loss of positioning consistency, | indicating the L1 norm,

a probability matrix representing the sound production corresponding to all feature points in the first frame image feature,

a probability matrix representing the sound production corresponding to all feature points in the second frame image feature,

and representing the probability matrix of all feature points in the characteristics of the spliced image corresponding to the sound production object.

In this embodiment, the probability matrix process of the utterances is: multiplying the pooled original audio features with the corresponding coded frame image features, summing the multiplication results on feature dimensions, and obtaining a probability matrix containing the sound production corresponding to all feature points in the frame image features after the activation function processing.

In one embodiment of the present invention, a training process for a dual coherent network based sound source localization and sound source separation method is described in detail. The method comprises the following specific steps.

1. A training data set is constructed.

Firstly, an audio and Video data set is obtained, a pair of videos Video _1 and Video _2 containing different sound domains (musical instruments) is randomly selected, and about 6 seconds of audio is randomly taken out according to a sampling rate of 11025Hz

And an image obtained after a certain frame image changes size

(ii) a Constructing mixed audio A at the same time₁₂=A₁+A₂And a stitched image stitched in the horizontal direction

。

2. And (5) feature coding.

For the original audio A obtained in step 1₁、A₂And mixed audio A₁₂First, a short-time fourier transform (STFT) with a Hann window size of 1022 and a hop (hop) length of 256 is performed, that is:

S_i=ResNet18_audio (A_i)

when A is_i=A₁、A₂Or A₁₂Then, respectively obtain the corresponding frequency spectrum

(ii) a Then, feature coding is carried out by using an audio ResNet model, namely:

F_Si=ResNet18_audio (S_i)

when S is_i=S₁、S₂Or S₁₂Respectively obtaining corresponding coded audio features

；

For the original frame image V obtained in step 1₁、V₂And a stitched image V₁₂And performing feature coding by using an image ResNet model pre-trained on ImageNet, namely:

F_Vi=ResNet18_image (V_i)

when V is_i=V₁、V₂Or V₁₂Then, the corresponding coded image features are obtained respectively

WhereindIs the dimension of the feature vector.

3. And separating sound sources.

3.1 vocal range detection:

first, for the data used in the present invention, setCSound fields (different instruments) for the coded mixed audio features obtained in step 2

The following transformation is used, however,

obtaining probabilities over respective sound fields

Wherein, in the step (A),

representing matrix multiplication, AvgPool2D representing two-dimensional average pooling,

a transformation matrix that can be learned is represented,

for the bias vector, sigmoid (.) represents a sigmoid function, the result is scaled to a (0,1) interval, and the model parameters can be updated by using binary cross entropy loss during training:

wherein the content of the first and second substances,

it means that the actual voiced field is 1, otherwise it is 0. When reasoning, directly take out the logit_fieldThe 2 values with the maximum interior probability correspond to the domains a and b (ideally, the A value corresponds to the A value)₁、A₂The domain in which it is located).

3.2 for the mixed audio spectrum S obtained in step 2₁₂Obtained through a classical segmentation network Unet

Then, for the coded splicing image characteristics obtained in step 2

After being transformed as follows

The interaction is carried out by the user,

deriving a prediction mask over the spectrum

Where, denotes a multiplication by element,

a transformation matrix that can be learned is represented,

is a bias vector.

3.3 updating network parameters by using binary cross entropy loss during training:

wherein the content of the first and second substances,

representing a spectral mask over the real vocal range.

3.4 the mask is then multiplied to the original mixed spectrum S₁₂In the middle, the frequency spectrum of each sound domain can be obtained

According to the sounding domains a and b obtained in step 3.1, the frequency spectrum of the corresponding domain is taken out

The correspondingly separated audio frequency can be obtained through inverse short-time Fourier transform (ISIFT)

。

3.5 during training, the consistency before and after separation needs to be ensured, and the following loss is applied:

wherein the content of the first and second substances,

representing the L1 norm, sum representing the sum in the domain dimension, and mean representing the average over the entire vector.

4. And (6) positioning a sound source.

4.1 for the image features obtained in step 2

And audio features

Maximal pooling of audio features to obtain

The design matching loss is as follows,

where sum represents the summation over the entire vector and mean represents the averaging over the entire vector.

4.2 during positioning, calculating probability matrixes of all feature points in the frame image features corresponding to the sound production objects:

wherein, sum_dRepresenting the summation in the feature dimension, i ∈ [1,2 ]]Taken out of

Is greater than the threshold

The area of (2) is the area where the sound-producing object is located; in particular, obtaining V₁Object O with middle sounding₁(ii) a In the same way, obtain V₂Object O with middle sounding₂。

4.3 Final construction image consistency loss is as follows:

wherein mean represents the average of the entire vector,

the calculation process of (2) refers to the formula in 4.2.

5. In the training process, end-to-end multitask training is performed on the even-to-even network in combination with the loss function.

Compared with the traditional method in the tasks of sound source positioning and sound source separation, the method provided by the invention treats the two tasks as dual tasks, simultaneously completes the dual tasks by using the same framework, and mutually enhances the performance in the training process by utilizing the characteristics of the two tasks, thereby finally improving the effect on the two tasks.

Corresponding to the foregoing embodiments of a method for dual coherent network-based sound source localization and sound source separation, the present application further provides a system for dual coherent network-based sound source localization and sound source separation, which includes:

an audio encoding module for encoding the original audio and the mixed audio;

the sound source positioning module: the system is used for positioning and obtaining a sounding object from a frame image according to an encoded original audio and the frame image;

For the system embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The system embodiments described above are merely illustrative, and may or may not be physically separate as the sound source separation module. In addition, each functional module in the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules may be integrated into one unit. The integrated modules or units can be implemented in the form of hardware, or in the form of software functional units, so that part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.

Examples

To further demonstrate the effectiveness of the present invention, the present invention performed experimental validation on the MUSIC data set, which contains 685 untrimmed videos collected from YouTube, wherein 536 solo and 149 duet videos. The video contains 11 instrument categories: accordion, acoustic guitar, cello, clarinet, erhu, flute, trumpet, saxophone, violin, xylophone, which is suitable for the sound source separation and sound source localization tasks. To verify the effectiveness of the present invention, for the sound source localization task, the intersection ratio (IoU) and the area under the curve (AUC) were used as evaluation indexes. The visual localization method revealed by SoP (Hang ZHao, Chuang Gan, Andrew roughenko, Carl Vondrick, Josh H. McDermott, and Antonio Torralba. The sound of pixels In ECCV, 2018) and DMC (Di Hu, Feiping Nie, and Xuelong Li. Deep Multimodal clustering for unsupervised audio learning.) as a comparison.

TABLE 1 Sound Source localization Experimental results

For the sound source separation task, the experiment takes a signal-to-distortion ratio (SDR), a signal-to-interference ratio (SIR) and a signal-to-spurious ratio (SAR) as evaluation indexes. A visual localization method, SoP (Hang ZHao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh H. McDermott, and Antonio Torralba. The sound of pixels. In ECCV, 2018) was shown as a comparison.

TABLE 2 Experimental results of Sound Source separation

Tables 1 and 2 show the evaluation results of the invention, and it can be seen that the results of the invention are superior to the results of other models, which indicates that the dual consistent network-based method has achieved a certain success, and the frame not only can simultaneously complete two tasks of sound source localization and sound source separation, but also can utilize dual characteristics of the two tasks to mutually enhance the performance of the two tasks in the training process through dual consistency loss.

The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims

1. A method for sound source localization and sound source separation based on dual coherent network is characterized by comprising the following steps:

the sound source separation network specifically comprises:

multiplying the predicted spectrum mask with the spectrum of the mixed audio, extracting the spectrum of the predicted sound domain according to the prediction result of the sound domain, and obtaining the audio corresponding to different sound domains separated from the mixed audio through inverse short-time Fourier transform;

the sound source positioning network firstly performs maximum pooling on the coded original audio features, positions the coded original audio features by using the result after the maximum pooling and the coded frame image features, calculates the probability of a sound production corresponding to each feature point in the frame image features, and takes the original frame image area corresponding to all feature point connected areas with the probability larger than a threshold value as a positioning result to realize positioning from the frame image to obtain the sound production;

5) performing end-to-end multi-task training on the dual coherent network, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process; realizing sound source positioning and sound source separation by using the trained dual consistent network;

the training process of the sound source separation network comprises the following steps:

marking a real spectrum mask in the spectrum of the mixed audio according to a real sound production domain, and calculating binary cross entropy loss function update parameters of a prediction spectrum mask and the real spectrum mask;

the consistency before and after separation needs to be ensured in the training process of the sound source separation network, and the loss of the separation consistency is calculated as follows:

loss__A＝mean(|sum_C(mask__pred*S₁₂)-S₁₂|)

in the formula, loss_ARepresents the loss of separation consistency, mean () represents the mean, sum_C(.) shows summing over the domain dimension, mask _ \_predRepresenting a predicted spectral mask, S₁₂Representing a frequency spectrum obtained after short-time Fourier transform of mixed audio, | represents an L1 norm;

the training process of the sound source localization network comprises the following steps:

matching the original audio features after the maximum pooling with the image features of the encoded frame, and calculating the matching loss:

in the formula, loss_MRepresents a match loss, mean (.) represents the mean, sum (.) represents the sum of the vectors;

Representing the ith coded frame image feature;

the consistency before and after positioning needs to be ensured in the training process of the sound source positioning network, and the positioning consistency loss is calculated:

in the formula, loss_VIndicating a loss of positioning consistency, | indicating the L1 norm,

2. The dual congruence network-based sound source localization and sound source separation method according to claim 1, wherein the mixed audio is obtained by splicing randomly extracted audio of the same length in a pair of videos in a time dimension; the spliced image is obtained by splicing the frame images corresponding to the two audio segments along the horizontal direction after the sizes of the frame images are changed.

3. The dual congruence network based sound source localization and sound source separation method of claim 1, wherein in step 2), the original audio and the mixed audio are encoded by:

carrying out short-time Fourier transform on original audio or mixed audio to be coded;

and encoding the short-time Fourier transform result by using an audio encoder.

4. The dual congruence network based sound source localization and sound source separation method according to claim 1, wherein the vocal tract detection specifically comprises:

and performing two-dimensional average pooling on the coded mixed audio features, performing matrix conversion and activation function processing to obtain the probability of each sound domain, taking the two sound domains with the maximum probability as prediction results, and updating parameters by using a binary cross entropy loss function.

5. The dual congruence network based sound source localization and sound source separation method of claim 1, wherein the probability matrix process of the utterances is: multiplying the pooled original audio features with the corresponding coded frame image features, summing the multiplication results on feature dimensions, and obtaining a probability matrix containing the sound production corresponding to all feature points in the frame image features after the activation function processing.

6. A dual coherent network based sound source localization and sound source separation system for implementing the sound source localization and sound source separation method of claim 1; the system for positioning and separating the sound source comprises:

an audio encoding module for encoding the original audio and the mixed audio;