CN113850246B - Method and system for sound source positioning and sound source separation based on dual coherent network - Google Patents
Method and system for sound source positioning and sound source separation based on dual coherent network Download PDFInfo
- Publication number
- CN113850246B CN113850246B CN202111441409.3A CN202111441409A CN113850246B CN 113850246 B CN113850246 B CN 113850246B CN 202111441409 A CN202111441409 A CN 202111441409A CN 113850246 B CN113850246 B CN 113850246B
- Authority
- CN
- China
- Prior art keywords
- sound source
- audio
- sound
- network
- positioning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4038—Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T9/00—Image coding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
Abstract
The invention discloses a method and a system for sound source positioning and sound source separation based on a dual coherent network, belonging to the field of image-audio multimode. The method mainly comprises the following steps: 1) the method comprises the steps of obtaining an audio and video data set, selecting a pair of videos belonging to different sound domains, extracting corresponding single-source audio and image information, and calculating mixed audio. 2) And respectively carrying out characteristic coding on the audio and the image to obtain audio and image characteristics. 3) And sending the mixed audio and the image characteristics to a sound source separation module of a dual consistent network together to separate single-source audio. 4) And sending the image and the corresponding audio characteristics to a sound source positioning module of a dual consistent network to obtain a sound production object in the image. Compared with the traditional method in the tasks of sound source positioning and sound source separation, the method provided by the invention treats the two tasks as dual tasks, simultaneously completes the dual tasks by using the same framework, and mutually enhances the performance in the training process by utilizing the characteristics of the two tasks, thereby finally improving the effect on the two tasks.
Description
Technical Field
The invention relates to the field of image-audio multimode, in particular to a method for positioning and separating a sound source based on a dual coherent network.
Background
The vision and the hearing are important ways for human beings to perceive the world, can identify and separate the sounds emitted by various objects, and can find the sound-emitting objects in a complex scene, so that the human beings have strong perception and are the basis for making follow-up complex decisions. Therefore, the machine has the capability of separating and positioning the sound source, and is a necessary way for realizing artificial intelligence.
Much of the current research is mainly focused on two separate tasks, namely sound source localization, visually guided sound separation, which, although they have achieved some success, still have some unsolved problems:
1) in the current visual guidance sound separation model, a specific image is required to query the sound corresponding to an object in the image, but when a plurality of objects exist in the image, the model cannot know which object corresponds to the sound, and the performance is poor.
2) At present, most models corresponding to two tasks cannot be processed simultaneously by one set of framework, and when audio needs to be positioned and separated simultaneously, the models are directly superposed, so that the models are complex and the calculation speed is low.
Disclosure of Invention
The invention provides a self-supervision dual-coincidence network, simultaneously utilizes the characteristics of the sound source positioning and sound source separation tasks, adopts the same framework to realize the sound source positioning and sound source separation tasks, and achieves the effect of mutual enhancement.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
one of the objectives of the present invention is to provide a method for sound source localization and sound source separation based on dual coherent network, comprising the following steps:
1) acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
2) respectively encoding the original audio and the frame image, and the mixed audio and the spliced image;
3) performing vocal tract detection on the coded mixed audio features to obtain different vocal tract detection results contained in the mixed audio;
4) constructing a dual consistent network comprising a sound source separation network and a sound source positioning network, taking the characteristics of mixed audio and coded spliced images as the input of the sound source separation network, separating audio corresponding to different sound domains from the mixed audio according to the detection results of the different sound domains, and calculating the separation loss;
the coded original audio and the frame image are used as the input of a sound source positioning network, a sound object is positioned from the frame image, and the matching loss is calculated;
5) performing end-to-end multi-task training on the dual coherent network, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process; and realizing sound source positioning and sound source separation by using the trained dual consistent network.
Another object of the present invention is to provide a sound source localization and separation system for implementing the above method, comprising:
the data acquisition module is used for acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
an audio encoding module for encoding the original audio and the mixed audio;
the image coding module is used for coding the frame image and the spliced image;
the sounding domain detection module is used for carrying out sounding domain detection on the coded mixed audio features to obtain different sounding domain detection results contained in the mixed audio;
a sound source separation module: the system is used for separating the audios corresponding to different sound domains from the mixed audio according to the mixed audio, the characteristics of the coded spliced images and the detection results of the different sound domains;
the sound source positioning module: the method is used for positioning and obtaining the sounding object from the frame image according to the encoded original audio and the frame image.
And the multi-task training module is used for performing end-to-end multi-task training on the sounding domain detection module, the sound source separation module and the sound source positioning module, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process.
Compared with the prior art, the invention has the following beneficial effects.
(1) The invention regards sound source positioning and sound source separation as dual tasks, thus using the same simple framework to solve the two tasks and obtaining better effect. In the traditional scheme, one task is basically solved singly, and the model is complex and cannot be directly superposed.
(2) The invention designs the dual consistent network by utilizing the characteristic of the dual task of sound source positioning and sound source separation, and can respectively enhance the positioning and separating performances by utilizing the separated audio and the positioned object, thereby achieving the effect of dual consistency and mutual promotion of the two tasks and obtaining better effect on the two tasks.
(3) In the invention, a method based on sound domain separation is designed in a sound source separation module, namely, when audio is separated, the separation results of all sound domains can be predicted, and the traditional method refers to the prediction of the separation result of a given image query; the invention solves the problem that when a plurality of objects exist in the image, the model cannot know the sound corresponding to the separated objects, so that the performance is poor.
Drawings
Fig. 1 is a schematic diagram illustrating a method for sound source localization and sound source separation based on dual coherent networks according to an embodiment of the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the method for sound source localization and sound source separation based on dual coherent network of the present invention mainly comprises the following steps.
Acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
step two, respectively encoding the original audio and the frame image, and the mixed audio and the spliced image;
thirdly, performing vocal tract detection on the coded mixed audio features to obtain different vocal tract detection results contained in the mixed audio;
step four, constructing a dual consistent network comprising a sound source separation network and a sound source positioning network, taking the characteristics of the mixed audio and the coded spliced image as the input of the sound source separation network, separating the audio corresponding to different sound domains from the mixed audio according to the detection results of the different sound domains, and calculating the separation loss;
the coded original audio and the frame image are used as the input of a sound source positioning network, a sound object is positioned from the frame image, and the matching loss is calculated;
step five, performing end-to-end multi-task training on the dual-coincidence network, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process; and realizing sound source positioning and sound source separation by using the trained dual consistent network.
Step one is used for constructing a training set.
In this embodiment, a pair of videos Video _1 and Video _2 containing different sound domains (musical instruments) is randomly selected, and the corresponding audio a is randomly extracted1、A2And a certain frame image V1、V2。
The mixed audio is obtained by splicing audio with the same length randomly extracted from a pair of videos in a time dimension, and in the embodiment, the mixed audio a is obtained by utilizing a two-segment audio structure12=A1+A2. The spliced image is obtained by splicing two frames of images corresponding to two audio frequencies in the horizontal direction after the sizes of the two frames of images are changed12=[V1,V2]。
And step two is used for coding the audio and the image.
In this embodiment, the encoding method for the original audio and the mixed audio is: firstly, carrying out short-time Fourier transform on original audio or mixed audio to be coded; and then, encoding the short-time Fourier transform result by using an audio encoder. The audio encoder can be realized by adopting the existing network such as ResNet.
The method for coding the original frame image and the spliced image comprises the following steps: the image is processed directly with an image encoder.
And step three, detecting the vocal range.
In this embodiment, the encoded mixed audio features are subjected to two-dimensional average pooling, then subjected to matrix conversion and activation function processing to obtain probabilities in each sound domain, and the two sound domains with the highest probabilities are used as prediction results, and parameters are updated by using a binary cross entropy loss function.
And step four, separating and positioning functions of the sound source separating network and the sound source positioning network are executed.
A. The sound source separation network specifically comprises:
carrying out short-time Fourier transform on the mixed audio to obtain a frequency spectrum, and utilizing a segmentation network to segment the frequency spectrum of the mixed audio;
performing two-dimensional average pooling on the coded spliced image features, interacting a pooling result with an audio segmentation result, and performing matrix conversion and activation function processing to obtain a predicted spectrum mask;
and multiplying the predicted spectrum mask with the spectrum of the mixed audio, extracting the spectrum of the predicted sound domain according to the prediction result of the sound domain, and obtaining the audio which is separated from the mixed audio and corresponds to different sound domains through inverse short-time Fourier transform.
B. The sound source positioning network specifically comprises:
firstly, performing maximum pooling on coded original audio features, positioning the coded frame image features by using a result after the maximum pooling and the coded frame image features, calculating the probability of a sound-producing object corresponding to each feature point in the frame image features, and taking an original frame image area corresponding to a communicated area of all feature points with the probability greater than a threshold value as a positioning result to realize positioning from the frame image to obtain the sound-producing object.
In order to train the sound source separation network, the invention marks a real spectrum mask in the frequency spectrum of the mixed audio according to a real sounding domain and calculates a binary cross entropy loss function update parameter of the prediction spectrum mask and the real spectrum mask.
In addition, the consistency before and after separation needs to be ensured in the training process of the sound source separation network, and the loss of the separation consistency is calculated:
loss_A=mean(|sum C (mask_pred*S12)-S12|)
in the formula, lossARepresents the loss of separation consistency, mean () represents the mean, sum C (.) shows summing over the domain dimension, mask _ \predRepresenting a predicted spectral mask, S12The spectrum of the mixed audio obtained after short-time Fourier transform is represented, | represents the L1 norm.
In order to train a sound source positioning network, the invention matches the original audio features after the maximum pooling with the frame image features after the encoding, and calculates the matching loss:
in the formula, loss_MRepresenting the match loss, mean () representing the mean, sum () representing the vector sum;representing the original audio features after the ith pooling, i ∈ [1,2 ]];Representing the ith coded frame image feature;
in addition, the consistency before and after positioning is also ensured in the training process of the sound source positioning network, and the positioning consistency loss is calculated:
in the formula, loss_VIndicating a loss of positioning consistency, | indicating the L1 norm,a probability matrix representing the sound production corresponding to all feature points in the first frame image feature,a probability matrix representing the sound production corresponding to all feature points in the second frame image feature,and representing the probability matrix of all feature points in the characteristics of the spliced image corresponding to the sound production object.
In this embodiment, the probability matrix process of the utterances is: multiplying the pooled original audio features with the corresponding coded frame image features, summing the multiplication results on feature dimensions, and obtaining a probability matrix containing the sound production corresponding to all feature points in the frame image features after the activation function processing.
In one embodiment of the present invention, a training process for a dual coherent network based sound source localization and sound source separation method is described in detail. The method comprises the following specific steps.
1. A training data set is constructed.
Firstly, an audio and Video data set is obtained, a pair of videos Video _1 and Video _2 containing different sound domains (musical instruments) is randomly selected, and about 6 seconds of audio is randomly taken out according to a sampling rate of 11025HzAnd an image obtained after a certain frame image changes size(ii) a Constructing mixed audio A at the same time12=A1+A2And a stitched image stitched in the horizontal direction。
2. And (5) feature coding.
For the original audio A obtained in step 11、A2And mixed audio A12First, a short-time fourier transform (STFT) with a Hann window size of 1022 and a hop (hop) length of 256 is performed, that is:
Si=ResNet18_audio (Ai)
when A isi=A1、A2Or A12Then, respectively obtain the corresponding frequency spectrum(ii) a Then, feature coding is carried out by using an audio ResNet model, namely:
FSi=ResNet18_audio (Si)
For the original frame image V obtained in step 11、V2And a stitched image V12And performing feature coding by using an image ResNet model pre-trained on ImageNet, namely:
FVi=ResNet18_image (Vi)
when V isi=V1、V2Or V12Then, the corresponding coded image features are obtained respectivelyWhereindIs the dimension of the feature vector.
3. And separating sound sources.
3.1 vocal range detection:
first, for the data used in the present invention, setCSound fields (different instruments) for the coded mixed audio features obtained in step 2The following transformation is used, however,
obtaining probabilities over respective sound fieldsWherein, in the step (A),representing matrix multiplication, AvgPool2D representing two-dimensional average pooling,a transformation matrix that can be learned is represented,for the bias vector, sigmoid (.) represents a sigmoid function, the result is scaled to a (0,1) interval, and the model parameters can be updated by using binary cross entropy loss during training:
wherein the content of the first and second substances,it means that the actual voiced field is 1, otherwise it is 0. When reasoning, directly take out the logitfieldThe 2 values with the maximum interior probability correspond to the domains a and b (ideally, the A value corresponds to the A value)1、A2The domain in which it is located).
3.2 for the mixed audio spectrum S obtained in step 212Obtained through a classical segmentation network UnetThen, for the coded splicing image characteristics obtained in step 2After being transformed as followsThe interaction is carried out by the user,
deriving a prediction mask over the spectrumWhere, denotes a multiplication by element,a transformation matrix that can be learned is represented,is a bias vector.
3.3 updating network parameters by using binary cross entropy loss during training:
wherein the content of the first and second substances,representing a spectral mask over the real vocal range.
3.4 the mask is then multiplied to the original mixed spectrum S12In the middle, the frequency spectrum of each sound domain can be obtainedAccording to the sounding domains a and b obtained in step 3.1, the frequency spectrum of the corresponding domain is taken outThe correspondingly separated audio frequency can be obtained through inverse short-time Fourier transform (ISIFT)。
3.5 during training, the consistency before and after separation needs to be ensured, and the following loss is applied:
wherein the content of the first and second substances,representing the L1 norm, sum representing the sum in the domain dimension, and mean representing the average over the entire vector.
4. And (6) positioning a sound source.
4.1 for the image features obtained in step 2And audio featuresMaximal pooling of audio features to obtainThe design matching loss is as follows,
where sum represents the summation over the entire vector and mean represents the averaging over the entire vector.
4.2 during positioning, calculating probability matrixes of all feature points in the frame image features corresponding to the sound production objects:
wherein, sumdRepresenting the summation in the feature dimension, i ∈ [1,2 ]]Taken out ofIs greater than the thresholdThe area of (2) is the area where the sound-producing object is located; in particular, obtaining V1Object O with middle sounding1(ii) a In the same way, obtain V2Object O with middle sounding2。
4.3 Final construction image consistency loss is as follows:
wherein mean represents the average of the entire vector,the calculation process of (2) refers to the formula in 4.2.
5. In the training process, end-to-end multitask training is performed on the even-to-even network in combination with the loss function.
Compared with the traditional method in the tasks of sound source positioning and sound source separation, the method provided by the invention treats the two tasks as dual tasks, simultaneously completes the dual tasks by using the same framework, and mutually enhances the performance in the training process by utilizing the characteristics of the two tasks, thereby finally improving the effect on the two tasks.
Corresponding to the foregoing embodiments of a method for dual coherent network-based sound source localization and sound source separation, the present application further provides a system for dual coherent network-based sound source localization and sound source separation, which includes:
the data acquisition module is used for acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
an audio encoding module for encoding the original audio and the mixed audio;
the image coding module is used for coding the frame image and the spliced image;
the sounding domain detection module is used for carrying out sounding domain detection on the coded mixed audio features to obtain different sounding domain detection results contained in the mixed audio;
a sound source separation module: the system is used for separating the audios corresponding to different sound domains from the mixed audio according to the mixed audio, the characteristics of the coded spliced images and the detection results of the different sound domains;
the sound source positioning module: the system is used for positioning and obtaining a sounding object from a frame image according to an encoded original audio and the frame image;
and the multi-task training module is used for performing end-to-end multi-task training on the sounding domain detection module, the sound source separation module and the sound source positioning module, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process.
For the system embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The system embodiments described above are merely illustrative, and may or may not be physically separate as the sound source separation module. In addition, each functional module in the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules may be integrated into one unit. The integrated modules or units can be implemented in the form of hardware, or in the form of software functional units, so that part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
To further demonstrate the effectiveness of the present invention, the present invention performed experimental validation on the MUSIC data set, which contains 685 untrimmed videos collected from YouTube, wherein 536 solo and 149 duet videos. The video contains 11 instrument categories: accordion, acoustic guitar, cello, clarinet, erhu, flute, trumpet, saxophone, violin, xylophone, which is suitable for the sound source separation and sound source localization tasks. To verify the effectiveness of the present invention, for the sound source localization task, the intersection ratio (IoU) and the area under the curve (AUC) were used as evaluation indexes. The visual localization method revealed by SoP (Hang ZHao, Chuang Gan, Andrew roughenko, Carl Vondrick, Josh H. McDermott, and Antonio Torralba. The sound of pixels In ECCV, 2018) and DMC (Di Hu, Feiping Nie, and Xuelong Li. Deep Multimodal clustering for unsupervised audio learning.) as a comparison.
TABLE 1 Sound Source localization Experimental results
For the sound source separation task, the experiment takes a signal-to-distortion ratio (SDR), a signal-to-interference ratio (SIR) and a signal-to-spurious ratio (SAR) as evaluation indexes. A visual localization method, SoP (Hang ZHao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh H. McDermott, and Antonio Torralba. The sound of pixels. In ECCV, 2018) was shown as a comparison.
TABLE 2 Experimental results of Sound Source separation
Tables 1 and 2 show the evaluation results of the invention, and it can be seen that the results of the invention are superior to the results of other models, which indicates that the dual consistent network-based method has achieved a certain success, and the frame not only can simultaneously complete two tasks of sound source localization and sound source separation, but also can utilize dual characteristics of the two tasks to mutually enhance the performance of the two tasks in the training process through dual consistency loss.
The foregoing lists merely illustrate specific embodiments of the invention. It is obvious that the invention is not limited to the above embodiments, but that many variations are possible. All modifications which can be derived or suggested by a person skilled in the art from the disclosure of the present invention are to be considered within the scope of the invention.
Claims (6)
1. A method for sound source localization and sound source separation based on dual coherent network is characterized by comprising the following steps:
1) acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
2) respectively encoding the original audio and the frame image, and the mixed audio and the spliced image;
3) performing vocal tract detection on the coded mixed audio features to obtain different vocal tract detection results contained in the mixed audio;
4) constructing a dual consistent network comprising a sound source separation network and a sound source positioning network, taking the characteristics of mixed audio and coded spliced images as the input of the sound source separation network, separating audio corresponding to different sound domains from the mixed audio according to the detection results of the different sound domains, and calculating the separation loss;
the coded original audio and the frame image are used as the input of a sound source positioning network, a sound object is positioned from the frame image, and the matching loss is calculated;
the sound source separation network specifically comprises:
carrying out short-time Fourier transform on the mixed audio to obtain a frequency spectrum, and utilizing a segmentation network to segment the frequency spectrum of the mixed audio;
performing two-dimensional average pooling on the coded spliced image features, interacting a pooling result with an audio segmentation result, and performing matrix conversion and activation function processing to obtain a predicted spectrum mask;
multiplying the predicted spectrum mask with the spectrum of the mixed audio, extracting the spectrum of the predicted sound domain according to the prediction result of the sound domain, and obtaining the audio corresponding to different sound domains separated from the mixed audio through inverse short-time Fourier transform;
the sound source positioning network firstly performs maximum pooling on the coded original audio features, positions the coded original audio features by using the result after the maximum pooling and the coded frame image features, calculates the probability of a sound production corresponding to each feature point in the frame image features, and takes the original frame image area corresponding to all feature point connected areas with the probability larger than a threshold value as a positioning result to realize positioning from the frame image to obtain the sound production;
5) performing end-to-end multi-task training on the dual coherent network, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process; realizing sound source positioning and sound source separation by using the trained dual consistent network;
the training process of the sound source separation network comprises the following steps:
marking a real spectrum mask in the spectrum of the mixed audio according to a real sound production domain, and calculating binary cross entropy loss function update parameters of a prediction spectrum mask and the real spectrum mask;
the consistency before and after separation needs to be ensured in the training process of the sound source separation network, and the loss of the separation consistency is calculated as follows:
loss_A=mean(|sumC(mask_pred*S12)-S12|)
in the formula, lossARepresents the loss of separation consistency, mean () represents the mean, sumC(.) shows summing over the domain dimension, mask _ \predRepresenting a predicted spectral mask, S12Representing a frequency spectrum obtained after short-time Fourier transform of mixed audio, | represents an L1 norm;
the training process of the sound source localization network comprises the following steps:
matching the original audio features after the maximum pooling with the image features of the encoded frame, and calculating the matching loss:
in the formula, lossMRepresents a match loss, mean (.) represents the mean, sum (.) represents the sum of the vectors;representing the original audio features after the ith pooling, i ∈ [1,2 ]];Representing the ith coded frame image feature;
the consistency before and after positioning needs to be ensured in the training process of the sound source positioning network, and the positioning consistency loss is calculated:
in the formula, lossVIndicating a loss of positioning consistency, | indicating the L1 norm,a probability matrix representing the sound production corresponding to all feature points in the first frame image feature,a probability matrix representing the sound production corresponding to all feature points in the second frame image feature,and representing the probability matrix of all feature points in the characteristics of the spliced image corresponding to the sound production object.
2. The dual congruence network-based sound source localization and sound source separation method according to claim 1, wherein the mixed audio is obtained by splicing randomly extracted audio of the same length in a pair of videos in a time dimension; the spliced image is obtained by splicing the frame images corresponding to the two audio segments along the horizontal direction after the sizes of the frame images are changed.
3. The dual congruence network based sound source localization and sound source separation method of claim 1, wherein in step 2), the original audio and the mixed audio are encoded by:
carrying out short-time Fourier transform on original audio or mixed audio to be coded;
and encoding the short-time Fourier transform result by using an audio encoder.
4. The dual congruence network based sound source localization and sound source separation method according to claim 1, wherein the vocal tract detection specifically comprises:
and performing two-dimensional average pooling on the coded mixed audio features, performing matrix conversion and activation function processing to obtain the probability of each sound domain, taking the two sound domains with the maximum probability as prediction results, and updating parameters by using a binary cross entropy loss function.
5. The dual congruence network based sound source localization and sound source separation method of claim 1, wherein the probability matrix process of the utterances is: multiplying the pooled original audio features with the corresponding coded frame image features, summing the multiplication results on feature dimensions, and obtaining a probability matrix containing the sound production corresponding to all feature points in the frame image features after the activation function processing.
6. A dual coherent network based sound source localization and sound source separation system for implementing the sound source localization and sound source separation method of claim 1; the system for positioning and separating the sound source comprises:
the data acquisition module is used for acquiring an audio and video data set, randomly selecting a pair of videos containing different sound domains from the data set, extracting original audio and frame images in the videos, and constructing mixed audio and spliced images according to each pair of videos;
an audio encoding module for encoding the original audio and the mixed audio;
the image coding module is used for coding the frame image and the spliced image;
the sounding domain detection module is used for carrying out sounding domain detection on the coded mixed audio features to obtain different sounding domain detection results contained in the mixed audio;
a sound source separation module: the system is used for separating the audios corresponding to different sound domains from the mixed audio according to the mixed audio, the characteristics of the coded spliced images and the detection results of the different sound domains;
the sound source positioning module: the system is used for positioning and obtaining a sounding object from a frame image according to an encoded original audio and the frame image;
and the multi-task training module is used for performing end-to-end multi-task training on the sounding domain detection module, the sound source separation module and the sound source positioning module, and keeping consistency before and after separation and consistency constraint before and after positioning in the training process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111441409.3A CN113850246B (en) | 2021-11-30 | 2021-11-30 | Method and system for sound source positioning and sound source separation based on dual coherent network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111441409.3A CN113850246B (en) | 2021-11-30 | 2021-11-30 | Method and system for sound source positioning and sound source separation based on dual coherent network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113850246A CN113850246A (en) | 2021-12-28 |
CN113850246B true CN113850246B (en) | 2022-02-18 |
Family
ID=78982562
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111441409.3A Active CN113850246B (en) | 2021-11-30 | 2021-11-30 | Method and system for sound source positioning and sound source separation based on dual coherent network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113850246B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114596876B (en) * | 2022-01-21 | 2023-04-07 | 中国科学院自动化研究所 | Sound source separation method and device |
CN115174959B (en) * | 2022-06-21 | 2024-01-30 | 咪咕文化科技有限公司 | Video 3D sound effect setting method and device |
CN115862682B (en) * | 2023-01-03 | 2023-06-20 | 杭州觅睿科技股份有限公司 | Sound detection method and related equipment |
CN117475360B (en) * | 2023-12-27 | 2024-03-26 | 南京纳实医学科技有限公司 | Biological feature extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110970056A (en) * | 2019-11-18 | 2020-04-07 | 清华大学 | Method for separating sound source from video |
CN112712819A (en) * | 2020-12-23 | 2021-04-27 | 电子科技大学 | Visual auxiliary cross-modal audio signal separation method |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3671739A1 (en) * | 2018-12-21 | 2020-06-24 | FRAUNHOFER-GESELLSCHAFT zur Förderung der angewandten Forschung e.V. | Apparatus and method for source separation using an estimation and control of sound quality |
US20210272573A1 (en) * | 2020-02-29 | 2021-09-02 | Robert Bosch Gmbh | System for end-to-end speech separation using squeeze and excitation dilated convolutional neural networks |
CN113674768A (en) * | 2021-04-02 | 2021-11-19 | 深圳市微纳感知计算技术有限公司 | Call-for-help detection method, device, equipment and storage medium based on acoustics |
-
2021
- 2021-11-30 CN CN202111441409.3A patent/CN113850246B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110970056A (en) * | 2019-11-18 | 2020-04-07 | 清华大学 | Method for separating sound source from video |
CN112712819A (en) * | 2020-12-23 | 2021-04-27 | 电子科技大学 | Visual auxiliary cross-modal audio signal separation method |
Non-Patent Citations (5)
Title |
---|
Monophonic singing voice separation based on deep learning;Yutian;《2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR)》;20190425;491-495 * |
Streaming End-to-End Multi-Talker Speech Recognition;Liang Lu等;《IEEE Signal Processing Letters》;20210402;803-807 * |
基于多模态融合的屏幕内外语音分离算法研究;杨宇;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20210915;I136-74 * |
基于视听信息融合的噪声源定位研究;赵义鹏等;《仪器仪表学报》;20180228;第39卷(第2期);89-99 * |
端到端声源分离研究:现状、进展和未来;书哲_深蓝学院;《https://www.jianshu.com/p/f47e5bee9949》;20200814;1-13 * |
Also Published As
Publication number | Publication date |
---|---|
CN113850246A (en) | 2021-12-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113850246B (en) | Method and system for sound source positioning and sound source separation based on dual coherent network | |
Morgado et al. | Self-supervised generation of spatial audio for 360 video | |
CN112071329B (en) | Multi-person voice separation method and device, electronic equipment and storage medium | |
US20200402497A1 (en) | Systems and Methods for Speech Generation | |
CN111539449B (en) | Sound source separation and positioning method based on second-order fusion attention network model | |
CN108962229B (en) | Single-channel and unsupervised target speaker voice extraction method | |
Parekh et al. | Motion informed audio source separation | |
Slizovskaia et al. | Conditioned source separation for musical instrument performances | |
CN112071330B (en) | Audio data processing method and device and computer readable storage medium | |
Fan et al. | Singing voice separation and pitch extraction from monaural polyphonic audio music via DNN and adaptive pitch tracking | |
Lu et al. | Self-supervised audio spatialization with correspondence classifier | |
CN111653270B (en) | Voice processing method and device, computer readable storage medium and electronic equipment | |
Dong et al. | Clipsep: Learning text-queried sound separation with noisy unlabeled videos | |
Montesinos et al. | Solos: A dataset for audio-visual music analysis | |
Osako et al. | Supervised monaural source separation based on autoencoders | |
Zhu et al. | Leveraging category information for single-frame visual sound source separation | |
Lai et al. | RPCA-DRNN technique for monaural singing voice separation | |
Feng et al. | SSLNet: A network for cross-modal sound source localization in visual scenes | |
CN115033734B (en) | Audio data processing method and device, computer equipment and storage medium | |
Qiu et al. | Self-Supervised Learning Based Phone-Fortified Speech Enhancement. | |
Ullrich et al. | Music transcription with convolutional sequence-to-sequence models | |
Reddy et al. | Audioslots: A slot-centric generative model for audio separation | |
Kitahara et al. | Instrogram: A new musical instrument recognition technique without using onset detection nor f0 estimation | |
Ngo et al. | Sound context classification based on joint learning model and multi-spectrogram features | |
WO2023002737A1 (en) | A method and system for scene-a ware audio-video representation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |