CN113850246B - Method and system for sound source positioning and sound source separation based on dual coherent network - Google Patents

Method and system for sound source positioning and sound source separation based on dual coherent network Download PDF

Info

Publication number
CN113850246B
CN113850246B CN202111441409.3A CN202111441409A CN113850246B CN 113850246 B CN113850246 B CN 113850246B CN 202111441409 A CN202111441409 A CN 202111441409A CN 113850246 B CN113850246 B CN 113850246B
Authority
CN
China
Prior art keywords
sound source
audio
sound
network
positioning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111441409.3A
Other languages
Chinese (zh)
Other versions
CN113850246A (en
Inventor
李昊沅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yizhi Intelligent Technology Co ltd
Original Assignee
Hangzhou Yizhi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yizhi Intelligent Technology Co ltd filed Critical Hangzhou Yizhi Intelligent Technology Co ltd
Priority to CN202111441409.3A priority Critical patent/CN113850246B/en
Publication of CN113850246A publication Critical patent/CN113850246A/en
Application granted granted Critical
Publication of CN113850246B publication Critical patent/CN113850246B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4038Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source

Abstract

The invention discloses a method and a system for sound source positioning and sound source separation based on a dual coherent network, belonging to the field of image-audio multimode. The method mainly comprises the following steps: 1) the method comprises the steps of obtaining an audio and video data set, selecting a pair of videos belonging to different sound domains, extracting corresponding single-source audio and image information, and calculating mixed audio. 2) And respectively carrying out characteristic coding on the audio and the image to obtain audio and image characteristics. 3) And sending the mixed audio and the image characteristics to a sound source separation module of a dual consistent network together to separate single-source audio. 4) And sending the image and the corresponding audio characteristics to a sound source positioning module of a dual consistent network to obtain a sound production object in the image. Compared with the traditional method in the tasks of sound source positioning and sound source separation, the method provided by the invention treats the two tasks as dual tasks, simultaneously completes the dual tasks by using the same framework, and mutually enhances the performance in the training process by utilizing the characteristics of the two tasks, thereby finally improving the effect on the two tasks.

Description

Method and system for sound source positioning and sound source separation based on dual coherent network
Technical Field
The invention relates to the field of image-audio multimode, in particular to a method for positioning and separating a sound source based on a dual coherent network.
Background
The vision and the hearing are important ways for human beings to perceive the world, can identify and separate the sounds emitted by various objects, and can find the sound-emitting objects in a complex scene, so that the human beings have strong perception and are the basis for making follow-up complex decisions. Therefore, the machine has the capability of separating and positioning the sound source, and is a necessary way for realizing artificial intelligence.
Much of the current research is mainly focused on two separate tasks, namely sound source localization, visually guided sound separation, which, although they have achieved some success, still have some unsolved problems:
1) in the current visual guidance sound separation model, a specific image is required to query the sound corresponding to an object in the image, but when a plurality of objects exist in the image, the model cannot know which object corresponds to the sound, and the performance is poor.
2) At present, most models corresponding to two tasks cannot be processed simultaneously by one set of framework, and when audio needs to be positioned and separated simultaneously, the models are directly superposed, so that the models are complex and the calculation speed is low.
Disclosure of Invention
The invention provides a self-supervision dual-coincidence network, simultaneously utilizes the characteristics of the sound source positioning and sound source separation tasks, adopts the same framework to realize the sound source positioning and sound source separation tasks, and achieves the effect of mutual enhancement.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
one of the objectives of the present invention is to provide a method for sound source localization and sound source separation based on dual coherent network, comprising the following steps:
1) acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
2) respectively encoding the original audio and the frame image, and the mixed audio and the spliced image;
3) performing vocal tract detection on the coded mixed audio features to obtain different vocal tract detection results contained in the mixed audio;
4) constructing a dual consistent network comprising a sound source separation network and a sound source positioning network, taking the characteristics of mixed audio and coded spliced images as the input of the sound source separation network, separating audio corresponding to different sound domains from the mixed audio according to the detection results of the different sound domains, and calculating the separation loss;
the coded original audio and the frame image are used as the input of a sound source positioning network, a sound object is positioned from the frame image, and the matching loss is calculated;
5) performing end-to-end multi-task training on the dual coherent network, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process; and realizing sound source positioning and sound source separation by using the trained dual consistent network.
Another object of the present invention is to provide a sound source localization and separation system for implementing the above method, comprising:
the data acquisition module is used for acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
an audio encoding module for encoding the original audio and the mixed audio;
the image coding module is used for coding the frame image and the spliced image;
the sounding domain detection module is used for carrying out sounding domain detection on the coded mixed audio features to obtain different sounding domain detection results contained in the mixed audio;
a sound source separation module: the system is used for separating the audios corresponding to different sound domains from the mixed audio according to the mixed audio, the characteristics of the coded spliced images and the detection results of the different sound domains;
the sound source positioning module: the method is used for positioning and obtaining the sounding object from the frame image according to the encoded original audio and the frame image.
And the multi-task training module is used for performing end-to-end multi-task training on the sounding domain detection module, the sound source separation module and the sound source positioning module, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process.
Compared with the prior art, the invention has the following beneficial effects.
(1) The invention regards sound source positioning and sound source separation as dual tasks, thus using the same simple framework to solve the two tasks and obtaining better effect. In the traditional scheme, one task is basically solved singly, and the model is complex and cannot be directly superposed.
(2) The invention designs the dual consistent network by utilizing the characteristic of the dual task of sound source positioning and sound source separation, and can respectively enhance the positioning and separating performances by utilizing the separated audio and the positioned object, thereby achieving the effect of dual consistency and mutual promotion of the two tasks and obtaining better effect on the two tasks.
(3) In the invention, a method based on sound domain separation is designed in a sound source separation module, namely, when audio is separated, the separation results of all sound domains can be predicted, and the traditional method refers to the prediction of the separation result of a given image query; the invention solves the problem that when a plurality of objects exist in the image, the model cannot know the sound corresponding to the separated objects, so that the performance is poor.
Drawings
Fig. 1 is a schematic diagram illustrating a method for sound source localization and sound source separation based on dual coherent networks according to an embodiment of the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the method for sound source localization and sound source separation based on dual coherent network of the present invention mainly comprises the following steps.
Acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
step two, respectively encoding the original audio and the frame image, and the mixed audio and the spliced image;
thirdly, performing vocal tract detection on the coded mixed audio features to obtain different vocal tract detection results contained in the mixed audio;
step four, constructing a dual consistent network comprising a sound source separation network and a sound source positioning network, taking the characteristics of the mixed audio and the coded spliced image as the input of the sound source separation network, separating the audio corresponding to different sound domains from the mixed audio according to the detection results of the different sound domains, and calculating the separation loss;
the coded original audio and the frame image are used as the input of a sound source positioning network, a sound object is positioned from the frame image, and the matching loss is calculated;
step five, performing end-to-end multi-task training on the dual-coincidence network, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process; and realizing sound source positioning and sound source separation by using the trained dual consistent network.
Step one is used for constructing a training set.
In this embodiment, a pair of videos Video _1 and Video _2 containing different sound domains (musical instruments) is randomly selected, and the corresponding audio a is randomly extracted1、A2And a certain frame image V1、V2
The mixed audio is obtained by splicing audio with the same length randomly extracted from a pair of videos in a time dimension, and in the embodiment, the mixed audio a is obtained by utilizing a two-segment audio structure12=A1+A2. The spliced image is obtained by splicing two frames of images corresponding to two audio frequencies in the horizontal direction after the sizes of the two frames of images are changed12=[V1,V2]。
And step two is used for coding the audio and the image.
In this embodiment, the encoding method for the original audio and the mixed audio is: firstly, carrying out short-time Fourier transform on original audio or mixed audio to be coded; and then, encoding the short-time Fourier transform result by using an audio encoder. The audio encoder can be realized by adopting the existing network such as ResNet.
The method for coding the original frame image and the spliced image comprises the following steps: the image is processed directly with an image encoder.
And step three, detecting the vocal range.
In this embodiment, the encoded mixed audio features are subjected to two-dimensional average pooling, then subjected to matrix conversion and activation function processing to obtain probabilities in each sound domain, and the two sound domains with the highest probabilities are used as prediction results, and parameters are updated by using a binary cross entropy loss function.
And step four, separating and positioning functions of the sound source separating network and the sound source positioning network are executed.
A. The sound source separation network specifically comprises:
carrying out short-time Fourier transform on the mixed audio to obtain a frequency spectrum, and utilizing a segmentation network to segment the frequency spectrum of the mixed audio;
performing two-dimensional average pooling on the coded spliced image features, interacting a pooling result with an audio segmentation result, and performing matrix conversion and activation function processing to obtain a predicted spectrum mask;
and multiplying the predicted spectrum mask with the spectrum of the mixed audio, extracting the spectrum of the predicted sound domain according to the prediction result of the sound domain, and obtaining the audio which is separated from the mixed audio and corresponds to different sound domains through inverse short-time Fourier transform.
B. The sound source positioning network specifically comprises:
firstly, performing maximum pooling on coded original audio features, positioning the coded frame image features by using a result after the maximum pooling and the coded frame image features, calculating the probability of a sound-producing object corresponding to each feature point in the frame image features, and taking an original frame image area corresponding to a communicated area of all feature points with the probability greater than a threshold value as a positioning result to realize positioning from the frame image to obtain the sound-producing object.
In order to train the sound source separation network, the invention marks a real spectrum mask in the frequency spectrum of the mixed audio according to a real sounding domain and calculates a binary cross entropy loss function update parameter of the prediction spectrum mask and the real spectrum mask.
In addition, the consistency before and after separation needs to be ensured in the training process of the sound source separation network, and the loss of the separation consistency is calculated:
loss_A=mean(|sum C (mask_pred*S12)-S12|)
in the formula, lossARepresents the loss of separation consistency, mean () represents the mean, sum C (.) shows summing over the domain dimension, mask _ \predRepresenting a predicted spectral mask, S12The spectrum of the mixed audio obtained after short-time Fourier transform is represented, | represents the L1 norm.
In order to train a sound source positioning network, the invention matches the original audio features after the maximum pooling with the frame image features after the encoding, and calculates the matching loss:
Figure 673489DEST_PATH_IMAGE001
in the formula, loss_MRepresenting the match loss, mean () representing the mean, sum () representing the vector sum;
Figure 138099DEST_PATH_IMAGE002
representing the original audio features after the ith pooling, i ∈ [1,2 ]];
Figure 714574DEST_PATH_IMAGE003
Representing the ith coded frame image feature;
in addition, the consistency before and after positioning is also ensured in the training process of the sound source positioning network, and the positioning consistency loss is calculated:
Figure 640942DEST_PATH_IMAGE004
in the formula, loss_VIndicating a loss of positioning consistency, | indicating the L1 norm,
Figure 2784DEST_PATH_IMAGE005
a probability matrix representing the sound production corresponding to all feature points in the first frame image feature,
Figure 356405DEST_PATH_IMAGE006
a probability matrix representing the sound production corresponding to all feature points in the second frame image feature,
Figure 420176DEST_PATH_IMAGE007
and representing the probability matrix of all feature points in the characteristics of the spliced image corresponding to the sound production object.
In this embodiment, the probability matrix process of the utterances is: multiplying the pooled original audio features with the corresponding coded frame image features, summing the multiplication results on feature dimensions, and obtaining a probability matrix containing the sound production corresponding to all feature points in the frame image features after the activation function processing.
In one embodiment of the present invention, a training process for a dual coherent network based sound source localization and sound source separation method is described in detail. The method comprises the following specific steps.
1. A training data set is constructed.
Firstly, an audio and Video data set is obtained, a pair of videos Video _1 and Video _2 containing different sound domains (musical instruments) is randomly selected, and about 6 seconds of audio is randomly taken out according to a sampling rate of 11025Hz
Figure 166546DEST_PATH_IMAGE008
And an image obtained after a certain frame image changes size
Figure 366584DEST_PATH_IMAGE009
(ii) a Constructing mixed audio A at the same time12=A1+A2And a stitched image stitched in the horizontal direction
Figure 891106DEST_PATH_IMAGE010
2. And (5) feature coding.
For the original audio A obtained in step 11、A2And mixed audio A12First, a short-time fourier transform (STFT) with a Hann window size of 1022 and a hop (hop) length of 256 is performed, that is:
Si=ResNet18_audio (Ai)
when A isi=A1、A2Or A12Then, respectively obtain the corresponding frequency spectrum
Figure 192905DEST_PATH_IMAGE011
(ii) a Then, feature coding is carried out by using an audio ResNet model, namely:
FSi=ResNet18_audio (Si)
when S isi=S1、S2Or S12Respectively obtaining corresponding coded audio features
Figure 461076DEST_PATH_IMAGE012
For the original frame image V obtained in step 11、V2And a stitched image V12And performing feature coding by using an image ResNet model pre-trained on ImageNet, namely:
FVi=ResNet18_image (Vi)
when V isi=V1、V2Or V12Then, the corresponding coded image features are obtained respectively
Figure 515619DEST_PATH_IMAGE013
WhereindIs the dimension of the feature vector.
3. And separating sound sources.
3.1 vocal range detection:
first, for the data used in the present invention, setCSound fields (different instruments) for the coded mixed audio features obtained in step 2
Figure 476622DEST_PATH_IMAGE014
The following transformation is used, however,
Figure 259858DEST_PATH_IMAGE015
obtaining probabilities over respective sound fields
Figure 66140DEST_PATH_IMAGE016
Wherein, in the step (A),
Figure 506349DEST_PATH_IMAGE017
representing matrix multiplication, AvgPool2D representing two-dimensional average pooling,
Figure 123406DEST_PATH_IMAGE018
a transformation matrix that can be learned is represented,
Figure 649065DEST_PATH_IMAGE019
for the bias vector, sigmoid (.) represents a sigmoid function, the result is scaled to a (0,1) interval, and the model parameters can be updated by using binary cross entropy loss during training:
Figure 259038DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure 304486DEST_PATH_IMAGE021
it means that the actual voiced field is 1, otherwise it is 0. When reasoning, directly take out the logitfieldThe 2 values with the maximum interior probability correspond to the domains a and b (ideally, the A value corresponds to the A value)1、A2The domain in which it is located).
3.2 for the mixed audio spectrum S obtained in step 212Obtained through a classical segmentation network Unet
Figure 607291DEST_PATH_IMAGE022
Then, for the coded splicing image characteristics obtained in step 2
Figure 89088DEST_PATH_IMAGE023
After being transformed as follows
Figure 519063DEST_PATH_IMAGE024
The interaction is carried out by the user,
Figure 668285DEST_PATH_IMAGE025
deriving a prediction mask over the spectrum
Figure 141992DEST_PATH_IMAGE026
Where, denotes a multiplication by element,
Figure 127396DEST_PATH_IMAGE027
a transformation matrix that can be learned is represented,
Figure 79172DEST_PATH_IMAGE028
is a bias vector.
3.3 updating network parameters by using binary cross entropy loss during training:
Figure 348479DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 993087DEST_PATH_IMAGE030
representing a spectral mask over the real vocal range.
3.4 the mask is then multiplied to the original mixed spectrum S12In the middle, the frequency spectrum of each sound domain can be obtained
Figure 200209DEST_PATH_IMAGE031
According to the sounding domains a and b obtained in step 3.1, the frequency spectrum of the corresponding domain is taken out
Figure 221254DEST_PATH_IMAGE032
The correspondingly separated audio frequency can be obtained through inverse short-time Fourier transform (ISIFT)
Figure 813910DEST_PATH_IMAGE033
3.5 during training, the consistency before and after separation needs to be ensured, and the following loss is applied:
Figure 645731DEST_PATH_IMAGE034
wherein the content of the first and second substances,
Figure 854995DEST_PATH_IMAGE035
representing the L1 norm, sum representing the sum in the domain dimension, and mean representing the average over the entire vector.
4. And (6) positioning a sound source.
4.1 for the image features obtained in step 2
Figure 414152DEST_PATH_IMAGE036
And audio features
Figure 143205DEST_PATH_IMAGE037
Maximal pooling of audio features to obtain
Figure 864036DEST_PATH_IMAGE038
The design matching loss is as follows,
Figure 295018DEST_PATH_IMAGE039
where sum represents the summation over the entire vector and mean represents the averaging over the entire vector.
4.2 during positioning, calculating probability matrixes of all feature points in the frame image features corresponding to the sound production objects:
Figure 402739DEST_PATH_IMAGE040
wherein, sumdRepresenting the summation in the feature dimension, i ∈ [1,2 ]]Taken out of
Figure 501145DEST_PATH_IMAGE041
Is greater than the threshold
Figure 392878DEST_PATH_IMAGE042
The area of (2) is the area where the sound-producing object is located; in particular, obtaining V1Object O with middle sounding1(ii) a In the same way, obtain V2Object O with middle sounding2
4.3 Final construction image consistency loss is as follows:
Figure 311155DEST_PATH_IMAGE043
wherein mean represents the average of the entire vector,
Figure 228427DEST_PATH_IMAGE044
the calculation process of (2) refers to the formula in 4.2.
5. In the training process, end-to-end multitask training is performed on the even-to-even network in combination with the loss function.
Compared with the traditional method in the tasks of sound source positioning and sound source separation, the method provided by the invention treats the two tasks as dual tasks, simultaneously completes the dual tasks by using the same framework, and mutually enhances the performance in the training process by utilizing the characteristics of the two tasks, thereby finally improving the effect on the two tasks.
Corresponding to the foregoing embodiments of a method for dual coherent network-based sound source localization and sound source separation, the present application further provides a system for dual coherent network-based sound source localization and sound source separation, which includes:
the data acquisition module is used for acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
an audio encoding module for encoding the original audio and the mixed audio;
the image coding module is used for coding the frame image and the spliced image;
the sounding domain detection module is used for carrying out sounding domain detection on the coded mixed audio features to obtain different sounding domain detection results contained in the mixed audio;
a sound source separation module: the system is used for separating the audios corresponding to different sound domains from the mixed audio according to the mixed audio, the characteristics of the coded spliced images and the detection results of the different sound domains;
the sound source positioning module: the system is used for positioning and obtaining a sounding object from a frame image according to an encoded original audio and the frame image;
and the multi-task training module is used for performing end-to-end multi-task training on the sounding domain detection module, the sound source separation module and the sound source positioning module, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process.
For the system embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The system embodiments described above are merely illustrative, and may or may not be physically separate as the sound source separation module. In addition, each functional module in the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules may be integrated into one unit. The integrated modules or units can be implemented in the form of hardware, or in the form of software functional units, so that part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
To further demonstrate the effectiveness of the present invention, the present invention performed experimental validation on the MUSIC data set, which contains 685 untrimmed videos collected from YouTube, wherein 536 solo and 149 duet videos. The video contains 11 instrument categories: accordion, acoustic guitar, cello, clarinet, erhu, flute, trumpet, saxophone, violin, xylophone, which is suitable for the sound source separation and sound source localization tasks. To verify the effectiveness of the present invention, for the sound source localization task, the intersection ratio (IoU) and the area under the curve (AUC) were used as evaluation indexes. The visual localization method revealed by SoP (Hang ZHao, Chuang Gan, Andrew roughenko, Carl Vondrick, Josh H. McDermott, and Antonio Torralba. The sound of pixels In ECCV, 2018) and DMC (Di Hu, Feiping Nie, and Xuelong Li. Deep Multimodal clustering for unsupervised audio learning.) as a comparison.
TABLE 1 Sound Source localization Experimental results
Figure 650181DEST_PATH_IMAGE045
For the sound source separation task, the experiment takes a signal-to-distortion ratio (SDR), a signal-to-interference ratio (SIR) and a signal-to-spurious ratio (SAR) as evaluation indexes. A visual localization method, SoP (Hang ZHao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh H. McDermott, and Antonio Torralba. The sound of pixels. In ECCV, 2018) was shown as a comparison.
TABLE 2 Experimental results of Sound Source separation
Figure 243973DEST_PATH_IMAGE046
Tables 1 and 2 show the evaluation results of the invention, and it can be seen that the results of the invention are superior to the results of other models, which indicates that the dual consistent network-based method has achieved a certain success, and the frame not only can simultaneously complete two tasks of sound source localization and sound source separation, but also can utilize dual characteristics of the two tasks to mutually enhance the performance of the two tasks in the training process through dual consistency loss.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.

Claims (6)

1. A method for sound source localization and sound source separation based on dual coherent network is characterized by comprising the following steps:
1) acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
2) respectively encoding the original audio and the frame image, and the mixed audio and the spliced image;
3) performing vocal tract detection on the coded mixed audio features to obtain different vocal tract detection results contained in the mixed audio;
4) constructing a dual consistent network comprising a sound source separation network and a sound source positioning network, taking the characteristics of mixed audio and coded spliced images as the input of the sound source separation network, separating audio corresponding to different sound domains from the mixed audio according to the detection results of the different sound domains, and calculating the separation loss;
the coded original audio and the frame image are used as the input of a sound source positioning network, a sound object is positioned from the frame image, and the matching loss is calculated;
the sound source separation network specifically comprises:
carrying out short-time Fourier transform on the mixed audio to obtain a frequency spectrum, and utilizing a segmentation network to segment the frequency spectrum of the mixed audio;
performing two-dimensional average pooling on the coded spliced image features, interacting a pooling result with an audio segmentation result, and performing matrix conversion and activation function processing to obtain a predicted spectrum mask;
multiplying the predicted spectrum mask with the spectrum of the mixed audio, extracting the spectrum of the predicted sound domain according to the prediction result of the sound domain, and obtaining the audio corresponding to different sound domains separated from the mixed audio through inverse short-time Fourier transform;
the sound source positioning network firstly performs maximum pooling on the coded original audio features, positions the coded original audio features by using the result after the maximum pooling and the coded frame image features, calculates the probability of a sound production corresponding to each feature point in the frame image features, and takes the original frame image area corresponding to all feature point connected areas with the probability larger than a threshold value as a positioning result to realize positioning from the frame image to obtain the sound production;
5) performing end-to-end multi-task training on the dual coherent network, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process; realizing sound source positioning and sound source separation by using the trained dual consistent network;
the training process of the sound source separation network comprises the following steps:
marking a real spectrum mask in the spectrum of the mixed audio according to a real sound production domain, and calculating binary cross entropy loss function update parameters of a prediction spectrum mask and the real spectrum mask;
the consistency before and after separation needs to be ensured in the training process of the sound source separation network, and the loss of the separation consistency is calculated as follows:
loss_A=mean(|sumC(mask_pred*S12)-S12|)
in the formula, lossARepresents the loss of separation consistency, mean () represents the mean, sumC(.) shows summing over the domain dimension, mask _ \predRepresenting a predicted spectral mask, S12Representing a frequency spectrum obtained after short-time Fourier transform of mixed audio, | represents an L1 norm;
the training process of the sound source localization network comprises the following steps:
matching the original audio features after the maximum pooling with the image features of the encoded frame, and calculating the matching loss:
Figure FDA0003461060450000021
in the formula, lossMRepresents a match loss, mean (.) represents the mean, sum (.) represents the sum of the vectors;
Figure FDA0003461060450000022
representing the original audio features after the ith pooling, i ∈ [1,2 ]];
Figure FDA0003461060450000023
Representing the ith coded frame image feature;
the consistency before and after positioning needs to be ensured in the training process of the sound source positioning network, and the positioning consistency loss is calculated:
Figure FDA0003461060450000024
in the formula, lossVIndicating a loss of positioning consistency, | indicating the L1 norm,
Figure FDA0003461060450000025
a probability matrix representing the sound production corresponding to all feature points in the first frame image feature,
Figure FDA0003461060450000026
a probability matrix representing the sound production corresponding to all feature points in the second frame image feature,
Figure FDA0003461060450000027
and representing the probability matrix of all feature points in the characteristics of the spliced image corresponding to the sound production object.
2. The dual congruence network-based sound source localization and sound source separation method according to claim 1, wherein the mixed audio is obtained by splicing randomly extracted audio of the same length in a pair of videos in a time dimension; the spliced image is obtained by splicing the frame images corresponding to the two audio segments along the horizontal direction after the sizes of the frame images are changed.
3. The dual congruence network based sound source localization and sound source separation method of claim 1, wherein in step 2), the original audio and the mixed audio are encoded by:
carrying out short-time Fourier transform on original audio or mixed audio to be coded;
and encoding the short-time Fourier transform result by using an audio encoder.
4. The dual congruence network based sound source localization and sound source separation method according to claim 1, wherein the vocal tract detection specifically comprises:
and performing two-dimensional average pooling on the coded mixed audio features, performing matrix conversion and activation function processing to obtain the probability of each sound domain, taking the two sound domains with the maximum probability as prediction results, and updating parameters by using a binary cross entropy loss function.
5. The dual congruence network based sound source localization and sound source separation method of claim 1, wherein the probability matrix process of the utterances is: multiplying the pooled original audio features with the corresponding coded frame image features, summing the multiplication results on feature dimensions, and obtaining a probability matrix containing the sound production corresponding to all feature points in the frame image features after the activation function processing.
6. A dual coherent network based sound source localization and sound source separation system for implementing the sound source localization and sound source separation method of claim 1; the system for positioning and separating the sound source comprises:
the data acquisition module is used for acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
an audio encoding module for encoding the original audio and the mixed audio;
the image coding module is used for coding the frame image and the spliced image;
the sounding domain detection module is used for carrying out sounding domain detection on the coded mixed audio features to obtain different sounding domain detection results contained in the mixed audio;
a sound source separation module: the system is used for separating the audios corresponding to different sound domains from the mixed audio according to the mixed audio, the characteristics of the coded spliced images and the detection results of the different sound domains;
the sound source positioning module: the system is used for positioning and obtaining a sounding object from a frame image according to an encoded original audio and the frame image;
and the multi-task training module is used for performing end-to-end multi-task training on the sounding domain detection module, the sound source separation module and the sound source positioning module, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process.
CN202111441409.3A 2021-11-30 2021-11-30 Method and system for sound source positioning and sound source separation based on dual coherent network Active CN113850246B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111441409.3A CN113850246B (en) 2021-11-30 2021-11-30 Method and system for sound source positioning and sound source separation based on dual coherent network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111441409.3A CN113850246B (en) 2021-11-30 2021-11-30 Method and system for sound source positioning and sound source separation based on dual coherent network

Publications (2)

Publication Number Publication Date
CN113850246A CN113850246A (en) 2021-12-28
CN113850246B true CN113850246B (en) 2022-02-18

Family

ID=78982562

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111441409.3A Active CN113850246B (en) 2021-11-30 2021-11-30 Method and system for sound source positioning and sound source separation based on dual coherent network

Country Status (1)

Country Link
CN (1) CN113850246B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596876B (en) * 2022-01-21 2023-04-07 中国科学院自动化研究所 Sound source separation method and device
CN115174959B (en) * 2022-06-21 2024-01-30 咪咕文化科技有限公司 Video 3D sound effect setting method and device
CN115862682B (en) * 2023-01-03 2023-06-20 杭州觅睿科技股份有限公司 Sound detection method and related equipment
CN117475360B (en) * 2023-12-27 2024-03-26 南京纳实医学科技有限公司 Biological feature extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970056A (en) * 2019-11-18 2020-04-07 清华大学 Method for separating sound source from video
CN112712819A (en) * 2020-12-23 2021-04-27 电子科技大学 Visual auxiliary cross-modal audio signal separation method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3671739A1 (en) * 2018-12-21 2020-06-24 FRAUNHOFER-GESELLSCHAFT zur Förderung der angewandten Forschung e.V. Apparatus and method for source separation using an estimation and control of sound quality
US20210272573A1 (en) * 2020-02-29 2021-09-02 Robert Bosch Gmbh System for end-to-end speech separation using squeeze and excitation dilated convolutional neural networks
CN113674768A (en) * 2021-04-02 2021-11-19 深圳市微纳感知计算技术有限公司 Call-for-help detection method, device, equipment and storage medium based on acoustics

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110970056A (en) * 2019-11-18 2020-04-07 清华大学 Method for separating sound source from video
CN112712819A (en) * 2020-12-23 2021-04-27 电子科技大学 Visual auxiliary cross-modal audio signal separation method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Monophonic singing voice separation based on deep learning;Yutian;《2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)》;20190425;491-495 *
Streaming End-to-End Multi-Talker Speech Recognition;Liang Lu等;《IEEE Signal Processing Letters》;20210402;803-807 *
基于多模态融合的屏幕内外语音分离算法研究;杨宇;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20210915;I136-74 *
基于视听信息融合的噪声源定位研究;赵义鹏等;《仪器仪表学报》;20180228;第39卷(第2期);89-99 *
端到端声源分离研究:现状、进展和未来;书哲_深蓝学院;《https://www.jianshu.com/p/f47e5bee9949》;20200814;1-13 *

Also Published As

Publication number Publication date
CN113850246A (en) 2021-12-28

Similar Documents

Publication Publication Date Title
CN113850246B (en) Method and system for sound source positioning and sound source separation based on dual coherent network
Morgado et al. Self-supervised generation of spatial audio for 360 video
CN112071329B (en) Multi-person voice separation method and device, electronic equipment and storage medium
US20200402497A1 (en) Systems and Methods for Speech Generation
CN111539449B (en) Sound source separation and positioning method based on second-order fusion attention network model
CN108962229B (en) Single-channel and unsupervised target speaker voice extraction method
Parekh et al. Motion informed audio source separation
Slizovskaia et al. Conditioned source separation for musical instrument performances
CN112071330B (en) Audio data processing method and device and computer readable storage medium
Fan et al. Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking
Lu et al. Self-supervised audio spatialization with correspondence classifier
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
Dong et al. Clipsep: Learning text-queried sound separation with noisy unlabeled videos
Montesinos et al. Solos: A dataset for audio-visual music analysis
Osako et al. Supervised monaural source separation based on autoencoders
Zhu et al. Leveraging category information for single-frame visual sound source separation
Lai et al. RPCA-DRNN technique for monaural singing voice separation
Feng et al. SSLNet: A network for cross-modal sound source localization in visual scenes
CN115033734B (en) Audio data processing method and device, computer equipment and storage medium
Qiu et al. Self-Supervised Learning Based Phone-Fortified Speech Enhancement.
Ullrich et al. Music transcription with convolutional sequence-to-sequence models
Reddy et al. Audioslots: A slot-centric generative model for audio separation
Kitahara et al. Instrogram: A new musical instrument recognition technique without using onset detection nor f0 estimation
Ngo et al. Sound context classification based on joint learning model and multi-spectrogram features
WO2023002737A1 (en) A method and system for scene-a ware audio-video representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant