CN114329036A - Cross-modal characteristic fusion system based on attention mechanism - Google Patents
Cross-modal characteristic fusion system based on attention mechanism Download PDFInfo
- Publication number
- CN114329036A CN114329036A CN202210256553.8A CN202210256553A CN114329036A CN 114329036 A CN114329036 A CN 114329036A CN 202210256553 A CN202210256553 A CN 202210256553A CN 114329036 A CN114329036 A CN 114329036A
- Authority
- CN
- China
- Prior art keywords
- video
- audio
- segment
- rgb
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 45
- 230000007246 mechanism Effects 0.000 title claims abstract description 18
- 238000000034 method Methods 0.000 claims abstract description 37
- 230000006870 function Effects 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 10
- 238000001228 spectrum Methods 0.000 claims description 10
- 238000012545 processing Methods 0.000 claims description 9
- 238000010219 correlation analysis Methods 0.000 claims description 8
- 230000005284 excitation Effects 0.000 claims description 8
- 239000000126 substance Substances 0.000 claims description 8
- 239000013598 vector Substances 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 5
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 4
- 230000002776 aggregation Effects 0.000 claims description 4
- 238000004220 aggregation Methods 0.000 claims description 4
- 238000004519 manufacturing process Methods 0.000 claims description 4
- 239000004576 sand Substances 0.000 claims description 4
- 230000002123 temporal effect Effects 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- 238000012098 association analyses Methods 0.000 abstract description 4
- 230000000295 complement effect Effects 0.000 abstract description 2
- 230000003287 optical effect Effects 0.000 description 8
- 230000000875 corresponding effect Effects 0.000 description 5
- 239000012634 fragment Substances 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008909 emotion recognition Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000005520 cutting process Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003416 augmentation Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002902 bimodal effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention provides a cross-modal characteristic fusion system based on an attention mechanism, which is based on the complementary relation of audio and video images, provides a method for extracting the characteristics of two modes of audio and video by using supervised contrast learning as a framework, simultaneously constructs an audio and video association analysis module to realize audio and video alignment, and designs a cross-modal characteristic fusion module based on the attention mechanism to realize the fusion of the audio and video characteristics. The audio and RGB pictures are used as input to achieve the goal of learning the video representation.
Description
Technical Field
The invention relates to the technical field of audio and video processing, in particular to a cross-modal feature fusion system based on an attention mechanism.
Background
For video representation learning, a number of supervised learning methods are receiving increasing attention. The method includes a conventional method and a deep learning method. For example, the dual-flow CNN determines the video image and the dense optical flow respectively, and then directly fuses the class scores of the two networks to obtain a classification result. C3D processes the video using a three-dimensional convolution kernel. The Time Segment Network (TSN) samples each video into several segments to model the remote temporal structure of the video. A Temporal Relationship Network (TRN) introduces an interpretable network to learn and infer temporal dependencies between video frames at multiple time scales. A Time Shift Module (TSM) moves portions of the channel along the time dimension to facilitate information exchange between adjacent frames. Although these surveillance methods achieve better performance in modeling time dependencies, most of them only extract information on RGB image modalities in video. With the development of the multi-modal field, researchers have begun to attempt to introduce multi-modal learning into video representation learning. Because of the dynamics and time strictness of the video, if the dynamic characteristics of the video can be learned, the learning capability of the network on the video characteristics can be undoubtedly improved. The optical flow is the instantaneous speed of the pixel motion of a space moving object on an observation imaging plane, and is a method for finding the corresponding relation between the previous frame and the current frame by using the change of the pixels in the image sequence on a time domain and the correlation between adjacent frames so as to calculate the motion information of the object between the adjacent frames. Since optical flow preferably includes motion information of video, most researchers use optical flow as a modality to improve the performance of video representation learning.
Although the RGB image contains static information of the video and the optical flow contains dynamic information of the video, the optical flow is also a modality based on the RGB image generation and is not completely independent of the RGB image modality; in the existing 3D convolution network, the dynamic information in the input image segment can be well extracted. Thus, the utilization of the optical flow modality reaches a bottleneck. In video, besides rich picture information, there is a lot of effective sound information. For example, in the behavior of cutting trees, the sound of cutting trees is often accompanied; some studies in the "baseball" practice, also accompanied by the sound of a bat hitting a baseball, have also demonstrated the effectiveness of audio. In the past related work, the network is trained by judging whether the audios and videos are aligned or not and judging whether the audios and videos belong to the same sample or not. Although the method can better perform information interaction between modalities, the method cannot solve the problems that the intra-class sample difference is large and the inter-class sample difference is small. Although these methods can achieve better characteristics, they have one disadvantage: the relevance of features between actions belonging to the same category is not taken into account.
The prior art discloses a patent of a bimodal emotion recognition method based on multimode deep learning, which respectively obtains three-channel input matrixes of audio and video RGB images and obtains an audio data sample and a video data sample; constructing an audio deep convolution neural network and a video deep convolution neural network to obtain high-level audio characteristics and high-level video characteristics; establishing a fusion network formed by full connection layers, and constructing a high-level audio and video unified characteristic; aggregating the audio and video unified features output by the last layer of full connection layer of the fusion network into global features, and inputting the global features into a classifier to obtain audio and video emotion recognition classification results; the fusion network formed by full connection layers is adopted to realize the fusion of audio and video emotion information, construct high-level audio and video uniform characteristic representation and effectively improve the audio and video emotion recognition performance. However, this invention does not relate to any technical content relating to learning a video representation using audio and RGB pictures as input.
Disclosure of Invention
The invention provides a cross-modal characteristic fusion system based on an attention mechanism, which realizes the fusion of audio and video characteristics and takes audio and RGB pictures as input to learn the video representation.
In order to achieve the technical effects, the technical scheme of the invention is as follows:
a cross-modal feature fusion system based on an attention mechanism, comprising:
the audio and video correlation analysis module is used for aligning two modes of audio and video RGB images;
the supervised contrast learning module is used for extracting the characteristics of the modes from the two modes of the audio and video RGB images;
and the cross-modal feature fusion module is used for learning the global context representation by utilizing the related knowledge between the modalities.
Further, the audio and video correlation analysis module continuously acquires RGB fragments v generated by RGB images of 16 frames from a section of video iiAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs an input to an audio modality; wherein i =1, ·, N.
Further, the specific processing procedure of the supervised contrast learning module is as follows:
1) and modal feature extraction: RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame areCorresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are;
2) Generating self-supervision comparison loss through self-supervision comparison learning;
3) and generating the supervised contrast loss through the supervised contrast learning.
Further, the specific process of generating the self-supervision contrast loss by the self-supervision contrast learning is as follows:
directly facing,}i=1,···,NExpressed as: RGB segment features from the same video iAnd Mel frequency spectrum characteristics generated corresponding to audio(ii) a Negative couple,}i, j =1, ·, N and i ≠ j∪{,}i, j =1, ·, N and i ≠ jExpressed as: RGB segment features derived from video i productionAnd all RGB slice features derived from video j (i ≠ j) generationAnd Mel frequency spectrum characteristics(ii) a The self-supervised contrast loss for the RGB image modality of the video is expressed as:
wherein the content of the first and second substances,is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the self-supervised contrast loss for audio modalities is:
the overall self-supervision contrast loss is given by the equations (1) and (2):
further, the specific process of generating the supervised contrast loss by the supervised contrast learning is as follows:
directly facing,}i, j =1, ·, N, and yi = yj∪{,}i, j =1, ·, N with i ≠ j and yi = yjExpressed as:and all RGB segment features from video i and video j of the same classAnd Mel-frequency-map features generated by audio,(ii) a The rest are negative pairs; the supervised contrast loss formula is as follows:
the supervised contrast loss for the RGB image modality of video is expressed as:
wherein the content of the first and second substances,
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the supervised contrast loss for audio modalities is:
the overall supervised contrast loss is given by equations (4) (5):
further, the cross-modal feature fusion module receives features from different modalities and learns global context embedding, which is then used to recalibrate input features from different segments, using video segment features learned from the supervised contrast learning framework as inputs, fused features as outputs, and computing the loss function of the fused portion by cross entropy.
Further, the specific processing procedure of the cross-modal feature fusion module is as follows:
two modes of a video i are viAnd aiThe feature extracted from the supervised contrast learning frame by the three-dimensional convolution network is,To exploit the correlation between the two modalities, the two feature vectors are connected and a joint representation is obtained through the full connection layer:
wherein [, ]]It is shown that the connection operation is performed,represents a joint representation, WsAnd bsIs the weight and offset of the fully connected layer; selectingTo limit model capacity and increase its generalization ability in order to exploit the joint representation ZuThe global context information of the aggregation, its excitation signal is predicted by a full connection layer:
wherein WeAnd beIs the weight and deviation of the full connection layer, and the excitation signal is obtainedThereafter, it is used to adaptively recalibrate the input features through a simple gating mechanism,:
Wherein [ ] is the channel product operation of each element in the channel dimension, δ (·) is a linear rectifying function, in this way, allowing the feature of one segment to recalibrate the feature of another segment while preserving the correlation between the different segments;
the two refined feature vectors,Connected and input into the full connection layer with the normalized exponential function soft-max as classification output, with cross entropy loss used to measure the correctness of the classification:
wherein y isiAnd pi Respectively representing the probability that the sample belongs to class i in the real case and the prediction, and C representing the number of all possible sequences;
the overall loss function is obtained by combining the formula (6) and the formula (10), whereinsupAnd λcrossAre controlled separatelyAndthe contribution of (c) is as follows:
further, each video segment is sized to have a size ofWhere c is the number of channels, l is the number of frames, and h and w represent the height and width of the frames.
Further, the size of the 3D convolution kernel is t × D × D, where t is the time length and D is the spatial size; defining a video RGB image sequence asWherein v isiAn RGB segment is generated for sampling m frames consecutively from a video i (i = 1...., N).
Further, the audio modality is a Mel frequency spectrogram generated by short-time Fourier transform of the whole audio of a video; aligning a segment of a video RGB image and a Mel frequency spectrogram generated by the whole video as input; the audio Mel-spectrogram sequence is represented asWherein a isiA mel-frequency spectrogram generated for audio extracted from a video i.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a method for extracting characteristics of two modes of audio and video by using supervised contrast learning as a framework based on a complementary relation of information of the audio and video images, constructs an audio and video association analysis module to realize audio and video alignment, and designs a cross-mode characteristic fusion module based on an attention mechanism to realize the fusion of the audio and video characteristics. The audio and RGB pictures are used as input to achieve the goal of learning the video representation.
Drawings
FIG. 1 is a block diagram of the overall process of the system of the present invention;
FIG. 2 is an exemplary diagram of audio/video comparison learning according to the present invention;
FIG. 3 is a block diagram of a Supervised Contrast Learning (SCL) process of the present invention;
fig. 4 is a framework diagram of a cross-modal feature fusion module (MFAM) processing procedure in the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, a cross-modal feature fusion system based on attention mechanism includes:
the audio and video correlation analysis module is used for aligning two modes of audio and video RGB images;
the supervised contrast learning module is used for extracting the characteristics of the modes from the two modes of the audio and video RGB images;
and the cross-modal feature fusion module is used for learning the global context representation by utilizing the related knowledge between the modalities.
An audio and video correlation analysis module continuously acquires RGB fragments v generated by 16 frames of RGB images from a video iiAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs an input to an audio modality; wherein i =1, ·, N.
The specific processing procedure of the supervised contrast learning module is as follows:
1) and modal feature extraction: RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame areCorresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are;
2) Generating self-supervision comparison loss through self-supervision comparison learning;
3) and generating the supervised contrast loss through the supervised contrast learning.
And aligning two modes of an audio and video RGB image through an audio and video association analysis (AVCA) module. In the module, for each video, a video RGB image mode and an audio mode are used as input. The video RGB image mode is that a segment of video randomly samples 16 continuous frames to form a segment. The audio modality is a Mel frequency spectrogram generated by short-time Fourier transform of the whole audio of a video. At this time, a segment of the video RGB image is aligned with the generated mel-frequency spectrogram of the whole video as an input.
In a Supervised Contrast Learning (SCL) module, spatio-temporal features are extracted for video RGB image segments and audio mel-frequency spectrograms respectively using two different three-dimensional convolutional neural networks (3D CNNs), and all 3D CNNs share the same weights. Then, supervised contrast loss is designed for the features generated by the two modalities to enhance the discriminative power of the homogeneous sample representation learning.
In multi-modal fusion, a cross-modal feature fusion module (MFAM) based on an attention mechanism is introduced, and features learned from a supervised contrast learning framework are adaptively propagated through the MFAM module and channel features are adaptively recalibrated. And calculating a loss function through cross entropy after connecting the calibrated features.
Example 2
As shown in fig. 1, a cross-modal feature fusion system based on attention mechanism includes:
the audio and video correlation analysis module is used for aligning two modes of audio and video RGB images;
the supervised contrast learning module is used for extracting the characteristics of the modes from the two modes of the audio and video RGB images;
and the cross-modal feature fusion module is used for learning the global context representation by utilizing the related knowledge between the modalities.
An audio and video correlation analysis module continuously acquires RGB fragments v generated by 16 frames of RGB images from a video iiAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs an input to an audio modality; wherein i =1, ·, N.
The specific processing procedure of the supervised contrast learning module is as follows:
1) and modal feature extraction: RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame areCorresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are;
2) Generating self-supervision comparison loss through self-supervision comparison learning;
3) and generating the supervised contrast loss through the supervised contrast learning.
The specific process of generating the self-supervision contrast loss by the self-supervision contrast learning is as follows:
directly facing,}i=1,···,NExpressed as: RGB segment features from the same video iAnd Mel frequency spectrum characteristics generated corresponding to audio(ii) a Negative couple,}i, j =1, ·, N and i ≠ j∪{,}i, j =1, ·, N and i ≠ jExpressed as: RGB segment features derived from video i productionAnd all RGB slice features derived from video j (i ≠ j) generationAnd Mel frequency spectrum characteristics(ii) a The self-supervised contrast loss for the RGB image modality of the video is expressed as:
wherein the content of the first and second substances,is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the self-supervised contrast loss for audio modalities is:
the overall self-supervision contrast loss is given by the equations (1) and (2):
the specific process of generating the supervised contrast loss by the supervised contrast learning is as follows:
directly facing,}i, j =1, ·, N, and yi = yj∪{,}i, j =1, ·, N with i ≠ j and yi = yjExpressed as:and all RGB segment features from video i and video j of the same classAnd Mel-frequency-map features generated by audio,(ii) a The rest are negative pairs; the supervised contrast loss formula is as follows:
the supervised contrast loss for the RGB image modality of video is expressed as:
wherein the content of the first and second substances,
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the supervised contrast loss for audio modalities is:
the overall supervised contrast loss is given by equations (4) (5):
the cross-modal feature fusion module receives features from different modalities and learns global context embedding, then the embedding is used for recalibrating input features from different segments, video segment features learned from a supervised contrast learning framework are used as input, the fused features are used as output, and loss functions of a fusion part are calculated through cross entropy.
The specific processing process of the cross-modal feature fusion module is as follows:
two modes of a video i are viAnd aiThe feature extracted from the supervised contrast learning frame by the three-dimensional convolution network is,To exploit the correlation between the two modalities, the two feature vectors are connected and a joint representation is obtained through the full connection layer:
wherein [, ]]It is shown that the connection operation is performed,represents a joint representation, WsAnd bsIs the weight and offset of the fully connected layer; selectingTo limit model capacity and increase its generalization ability in order to exploit the joint representation ZuThe global context information of the aggregation, its excitation signal is predicted by a full connection layer:
wherein WeAnd beIs the weight and deviation of the full connection layer, and the excitation signal is obtainedThereafter, it is used to adaptively recalibrate the input features through a simple gating mechanism,:
Wherein [ ] is the channel product operation of each element in the channel dimension, δ (·) is a linear rectifying function, in this way, allowing the feature of one segment to recalibrate the feature of another segment while preserving the correlation between the different segments;
the two refined feature vectors,Connected and input into the full connection layer with the normalized exponential function soft-max as classification output, with cross entropy loss used to measure the correctness of the classification:
wherein y isiAnd pi Respectively representing the probability that the sample belongs to class i in the real case and the prediction, and C representing the number of all possible sequences;
the overall loss function is obtained by combining the formula (6) and the formula (10), whereinsupAnd λcrossAre controlled separatelyAndthe contribution of (c) is as follows:
example 3
To facilitate the description of each module, given N different videos, the segments of each video are sized byWhere c is the number of channels, l is the number of frames, and h and w represent the height and width of the frames. The size of the 3D convolution kernel is t × D × D, where t is the time length and D is the spatial size; defining a video RGB image sequence asWherein v isiGenerated for successive sampling of m frames from one video i (i = 1...., N)An RGB segment. The audio mode is a Mel frequency spectrogram generated by the whole audio of a section of video through short-time Fourier transform; aligning a segment of a video RGB image and a Mel frequency spectrogram generated by the whole video as input; the audio Mel-spectrogram sequence is represented asWherein a isiA mel-frequency spectrogram generated for audio extracted from a video i.Is a category label for video i.
1) Audio video association analysis (audio video alignment)
The sound signal is a one-dimensional signal, and only time domain information can be visually seen, but frequency domain information cannot be seen. And the frequency domain can be transformed by Fourier Transform (FT), but time domain information is lost, and the time-frequency relation cannot be seen. In order to solve the problem, a plurality of methods are generated, and short-time Fourier transform, wavelet and the like are common time-frequency analysis methods.
Short-time fourier transform (STFT) is a fourier transform of a short-time signal. The principle is as follows: the method comprises the steps of framing and windowing a long speech signal, performing Fourier transform on each frame, and stacking results of each frame along the other dimension to obtain a graph (similar to a two-dimensional signal), wherein the graph is a spectrogram.
Since the obtained spectrogram is large, in order to obtain a sound feature with a proper size, it is usually passed through a Mel-scale filter banks (Mel-scale filter banks) to become a Mel-scale spectrum.
In the conventional audio-video alignment, mostly, a mel frequency spectrum generated by an RGB image corresponding to a certain time length audio is aligned. This method can align the two modalities and extract still image information and audio information of the video, but ignores timing information included in the video itself.
In order to utilize the time sequence information of the video, the invention looks from one segmentRGB segments v produced by the continuous acquisition of 16 frames of RGB images in frequency i (i = 1...., N)iAs input to the RGB image modality. At this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs input to an audio modality.
1.1) Audio video comparison learning
In the field of images, self-supervision contrast learning is a common learning method. The core idea is that the data features from the same sample are as close as possible, while the data features from different samples are as far apart as possible. In the field of images, data is generally augmented by turning or cropping a picture, so that a positive pair is formed between the generated picture and an original picture, and a negative pair is formed between all other pictures and the original picture. The positive pair is close and the negative pair is far by contrast loss.
In order to make the features of similar actions similar to each other, a new comparative learning method is proposed by researchers: there is supervised contrast learning. The core idea is that the distance between data features from the same category is as close as possible, while the distance between data features from different categories is as far as possible. At this time, the positive pair is extended to the picture generated by the original picture augmentation and the picture having the same category information as the original picture, and the negative pair is all the pictures not belonging to the same category as the original picture.
Although the contrast learning method has been widely applied in the field of image learning, some students also introduce the method into video representation learning, but the combination of contrast learning in the multi-modal field is proposed in recent years. In the multi-modal field, most scholars only learn the RGB images and the optical flow as two modalities, and the audio is used as one of the modalities, which is much less and less. Therefore, the method introduces supervised contrast learning into audio/video multi-mode learning, so that the model can better extract different modal characteristics and can better distinguish samples with large intra-class difference and small inter-class difference.
2) Modal feature extraction
RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame areCorresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are. The difference between the two networks is the number of channels of the image input.
3) Supervised contrast learning
3.1) self-supervised contrast learning
As shown in fig. 3, there is a Supervised Contrast Learning (SCL) framework: the kernel of the self-supervised contrast learning is that the distance between data features from the same sample is as close as possible, and the data features from different samples are as far as possible.
In the invention, the RGB fragment characteristics of the video i are usedAs an example, a directly facing,}i=1,···,NExpressed as: RGB segment features from the same video iAnd Mel frequency spectrum characteristics generated corresponding to audio(ii) a Negative couple,}i, j =1, ·, N and i ≠ j∪{,}i, j =1, ·, N and i ≠ jExpressed as: RGB segment features derived from video i productionAnd all RGB slice features derived from video j (i ≠ j) generationAnd Mel frequency spectrum characteristics. As shown in fig. 2. At this time, the self-supervised contrast loss of the RGB image modality of the video is expressed as:
wherein the content of the first and second substances,is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the self-supervised contrast loss for audio modalities is:
the overall self-supervision contrast loss is given by the equations (1) and (2):
3.2) supervised contrast learning
Although self-supervised contrast learning can learn better features, there is a disadvantage: the relevance of features between actions belonging to the same category is not considered; in order to make the characteristics of the same kind of actions close to each other, a new comparison learning method is adopted: there is supervised contrast. The core of supervised contrast learning is to make the distance between data features from the same class sample as close as possible, and the distance between data features from different classes as far as possible.
In the present invention, the RGB fragment featuresAs an example, a directly facing,}i, j =1, ·, N, and yi = yj∪{,}i, j =1, ·, N with i ≠ j and yi = yjExpressed as:and all RGB segment features from video i and video j of the same classAnd Mel-frequency-map features generated by audio,(ii) a The rest are negative pairs; the supervised contrast loss formula is as follows:
the supervised contrast loss for the RGB image modality of video is expressed as:
wherein the content of the first and second substances,
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the supervised contrast loss for audio modalities is:
the overall supervised contrast loss is given by equations (4) (5):
4) multimodal fusion
As shown in fig. 4, in order to better fuse information between different modalities, a cross-modality feature fusion (MFAM) module based on attention mechanism is proposed. Since features from different modalities are correlated, a cross-modality feature fusion module is constructed that receives features from different modalities and learns global context embedding, which is then used to re-align input features from different segments, as shown in FIG. 4. And (3) using the video segment characteristics learned from the supervised contrast learning framework as input, using the fused characteristics as output, and calculating the loss function of the fusion part through cross entropy.
To fix the symbols, assume that the two modalities of a video i are viAnd aiThe feature extracted from the supervised contrast learning frame by the three-dimensional convolution network is,To exploit the correlation between the two modalities, the two feature vectors are connected and a joint representation is obtained through the full connection layer:
wherein [, ]]It is shown that the connection operation is performed,represents a joint representation, WsAnd bsIs the weight and offset of the fully connected layer; selectingTo limit model capacity and increase its generalization ability in order to exploit the joint representation ZuThe global context information of the aggregation, its excitation signal is predicted by a full connection layer:
wherein WeAnd beIs the weight and deviation of the full connection layer, and the excitation signal is obtainedThereafter, it is used to adaptively recalibrate the input features through a simple gating mechanism,:
Wherein [ ] is the channel product operation of each element in the channel dimension, δ (·) is a linear rectifying function, in this way, allowing the feature of one segment to recalibrate the feature of another segment while preserving the correlation between the different segments;
the two refined feature vectors,Connected and input into the full connection layer with the normalized exponential function soft-max as classification output, with cross entropy loss used to measure the correctness of the classification:
wherein y isiAnd pi Respectively representing the probability that the sample belongs to class i in the real case and the prediction, and C representing the number of all possible sequences;
the overall loss function is obtained by combining the formula (6) and the formula (10), whereinsupAnd λcrossAre controlled separatelyAndthe contribution of (c) is as follows:
the same or similar reference numerals correspond to the same or similar parts;
the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.
Claims (10)
1. A cross-modal feature fusion system based on an attention mechanism, comprising:
the audio and video correlation analysis module is used for aligning two modes of audio and video RGB images;
the supervised contrast learning module is used for extracting the characteristics of the modes from the two modes of the audio and video RGB images;
and the cross-modal feature fusion module is used for learning the global context representation by utilizing the related knowledge between the modalities.
2. The attention-based cross-modal feature fusion system of claim 1, wherein the audio-video correlation analysis module continuously collects RGB segments v generated by 16 RGB images from a video iiAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segmentiAs an input to an audio modality; wherein i =1, ·, N.
3. The attention-based cross-modal feature fusion system of claim 2, wherein the specific processing procedure of the supervised contrast learning module is as follows:
1) and modal feature extraction: RGB segment v of video iiThe features extracted after the 3D convolution network with r3D as the basic frame areCorresponding to the Mel spectrogram a of audioiThe features extracted after the 3D convolution network of the audio are;
2) Generating self-supervision comparison loss through self-supervision comparison learning;
3) and generating the supervised contrast loss through the supervised contrast learning.
4. The cross-modal feature fusion system based on attention mechanism as claimed in claim 3 wherein the specific process of generating the self-supervised contrast loss by the self-supervised contrast learning is:
directly facing,}i=1,···,NExpressed as: RGB segment features from the same video iAnd Mel frequency spectrum characteristics generated corresponding to audio(ii) a Negative couple,}i, j =1, ·, N and i ≠ j∪{,}i, j =1, ·, N and i ≠ jExpressed as: RGB segment features derived from video i productionAnd all RGB slice features derived from video j (i ≠ j) generationAnd Mel frequency spectrum characteristics(ii) a The self-supervised contrast loss for the RGB image modality of the video is expressed as:
wherein the content of the first and second substances,is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the self-supervised contrast loss for audio modalities is:
the overall self-supervision contrast loss is given by the equations (1) and (2):
5. the attention-based cross-modal feature fusion system of claim 4, wherein the specific process of generating supervised contrast loss by supervised contrast learning is as follows:
directly facing,}i, j =1, ·, N, and yi = yj∪{,}i, j =1, ·, N with i ≠ j and yi = yjExpressed as:and all RGB segment features from video i and video j of the same classAnd Mel-frequency-map features generated by audio,(ii) a The rest are negative pairs; the supervised contrast loss formula is as follows:
the supervised contrast loss for the RGB image modality of video is expressed as:
wherein the content of the first and second substances,
is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;
similarly, the supervised contrast loss for audio modalities is:
the overall supervised contrast loss is given by equations (4) (5):
6. the attention-based cross-modal feature fusion system of claim 5 wherein the cross-modal feature fusion module receives features from different modalities and learns global context embedding, which is then used to recalibrate input features from different segments, using video segment features learned from the supervised contrast learning framework as input, fused features as output, and computing the loss function of the fused portion by cross entropy.
7. The attention-based cross-modal feature fusion system of claim 6, wherein the specific processing procedure of the cross-modal feature fusion module is:
two modes of a video i are viAnd aiThe slave has supervisionThe characteristic extracted by the three-dimensional convolution network in the Du-contrast learning frame is,To exploit the correlation between the two modalities, the two feature vectors are connected and a joint representation is obtained through the full connection layer:
wherein [, ]]It is shown that the connection operation is performed,represents a joint representation, WsAnd bsIs the weight and offset of the fully connected layer; selectingTo limit model capacity and increase its generalization ability in order to exploit the joint representation ZuThe global context information of the aggregation, its excitation signal is predicted by a full connection layer:
wherein WeAnd beIs the weight and deviation of the full connection layer, and the excitation signal is obtainedThereafter, it is used to adaptively recalibrate the input features through a simple gating mechanism,:
Wherein [ ] is the channel product operation of each element in the channel dimension, δ (·) is a linear rectifying function, in this way, allowing the feature of one segment to recalibrate the feature of another segment while preserving the correlation between the different segments;
the two refined feature vectors,Connected and input into the full connection layer with the normalized exponential function soft-max as classification output, with cross entropy loss used to measure the correctness of the classification:
wherein y isiAnd pi Respectively representing the probability that the sample belongs to class i in the real case and the prediction, and C representing the number of all possible sequences;
the overall loss function is obtained by combining the formula (6) and the formula (10), whereinsupAnd λcrossAre controlled separatelyAndthe contribution of (c) is as follows:
9. The attention-based cross-modal feature fusion system of claim 8 wherein the size of the 3D convolution kernel is txdxd, where t is the temporal length and D is the spatial size; defining a video RGB image sequence asWherein v isiAn RGB segment is generated for sampling m frames consecutively from a video i (i = 1...., N).
10. The attention-based cross-modal feature fusion system of claim 9, wherein the audio modality is a mel-frequency spectrogram generated by a short-time fourier transform of an entire audio segment of a video; aligning a segment of a video RGB image and a Mel frequency spectrogram generated by the whole video as input; the audio Mel-spectrogram sequence is represented asWherein a isiA mel-frequency spectrogram generated for audio extracted from a video i.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210256553.8A CN114329036B (en) | 2022-03-16 | 2022-03-16 | Cross-modal characteristic fusion system based on attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210256553.8A CN114329036B (en) | 2022-03-16 | 2022-03-16 | Cross-modal characteristic fusion system based on attention mechanism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114329036A true CN114329036A (en) | 2022-04-12 |
CN114329036B CN114329036B (en) | 2022-07-05 |
Family
ID=81033312
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210256553.8A Active CN114329036B (en) | 2022-03-16 | 2022-03-16 | Cross-modal characteristic fusion system based on attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114329036B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115019397A (en) * | 2022-06-15 | 2022-09-06 | 北京大学深圳研究生院 | Comparison self-monitoring human behavior recognition method and system based on temporal-spatial information aggregation |
CN115100390A (en) * | 2022-08-24 | 2022-09-23 | 华东交通大学 | Image emotion prediction method combining contrast learning and self-supervision region positioning |
CN115116448A (en) * | 2022-08-29 | 2022-09-27 | 四川启睿克科技有限公司 | Voice extraction method, neural network model training method, device and storage medium |
CN115620110A (en) * | 2022-12-16 | 2023-01-17 | 华南理工大学 | Video event positioning and identifying method, device and storage medium |
CN116824495A (en) * | 2023-06-26 | 2023-09-29 | 华东交通大学 | Dangerous behavior identification method, system, storage medium and computer equipment |
CN117173394A (en) * | 2023-08-07 | 2023-12-05 | 山东大学 | Weak supervision salient object detection method and system for unmanned aerial vehicle video data |
WO2024087337A1 (en) * | 2022-10-24 | 2024-05-02 | 深圳先进技术研究院 | Method for directly synthesizing speech from tongue ultrasonic images |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112820320A (en) * | 2020-12-31 | 2021-05-18 | 中国科学技术大学 | Cross-modal attention consistency network self-supervision learning method |
US20210342646A1 (en) * | 2020-04-30 | 2021-11-04 | Arizona Board Of Regents On Behalf Of Arizona State University | Systems, methods, and apparatuses for training a deep model to learn contrastive representations embedded within part-whole semantics via a self-supervised learning framework |
CN114118200A (en) * | 2021-09-24 | 2022-03-01 | 杭州电子科技大学 | Multi-modal emotion classification method based on attention-guided bidirectional capsule network |
-
2022
- 2022-03-16 CN CN202210256553.8A patent/CN114329036B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210342646A1 (en) * | 2020-04-30 | 2021-11-04 | Arizona Board Of Regents On Behalf Of Arizona State University | Systems, methods, and apparatuses for training a deep model to learn contrastive representations embedded within part-whole semantics via a self-supervised learning framework |
CN112820320A (en) * | 2020-12-31 | 2021-05-18 | 中国科学技术大学 | Cross-modal attention consistency network self-supervision learning method |
CN114118200A (en) * | 2021-09-24 | 2022-03-01 | 杭州电子科技大学 | Multi-modal emotion classification method based on attention-guided bidirectional capsule network |
Non-Patent Citations (1)
Title |
---|
檀华东: "面向视听觉数据的跨模态生成及同步判别研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115019397A (en) * | 2022-06-15 | 2022-09-06 | 北京大学深圳研究生院 | Comparison self-monitoring human behavior recognition method and system based on temporal-spatial information aggregation |
CN115019397B (en) * | 2022-06-15 | 2024-04-19 | 北京大学深圳研究生院 | Method and system for identifying contrasting self-supervision human body behaviors based on time-space information aggregation |
CN115100390A (en) * | 2022-08-24 | 2022-09-23 | 华东交通大学 | Image emotion prediction method combining contrast learning and self-supervision region positioning |
CN115116448A (en) * | 2022-08-29 | 2022-09-27 | 四川启睿克科技有限公司 | Voice extraction method, neural network model training method, device and storage medium |
CN115116448B (en) * | 2022-08-29 | 2022-11-15 | 四川启睿克科技有限公司 | Voice extraction method, neural network model training method, device and storage medium |
WO2024087337A1 (en) * | 2022-10-24 | 2024-05-02 | 深圳先进技术研究院 | Method for directly synthesizing speech from tongue ultrasonic images |
CN115620110A (en) * | 2022-12-16 | 2023-01-17 | 华南理工大学 | Video event positioning and identifying method, device and storage medium |
CN115620110B (en) * | 2022-12-16 | 2023-03-21 | 华南理工大学 | Video event positioning and identifying method, device and storage medium |
CN116824495A (en) * | 2023-06-26 | 2023-09-29 | 华东交通大学 | Dangerous behavior identification method, system, storage medium and computer equipment |
CN117173394A (en) * | 2023-08-07 | 2023-12-05 | 山东大学 | Weak supervision salient object detection method and system for unmanned aerial vehicle video data |
CN117173394B (en) * | 2023-08-07 | 2024-04-02 | 山东大学 | Weak supervision salient object detection method and system for unmanned aerial vehicle video data |
Also Published As
Publication number | Publication date |
---|---|
CN114329036B (en) | 2022-07-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114329036B (en) | Cross-modal characteristic fusion system based on attention mechanism | |
Liu et al. | Deep learning for generic object detection: A survey | |
CN108804453B (en) | Video and audio recognition method and device | |
Lee et al. | Multi-view automatic lip-reading using neural network | |
US20200004493A1 (en) | Electronic apparatus, document displaying method thereof and non-transitory computer readable recording medium | |
CN112818861A (en) | Emotion classification method and system based on multi-mode context semantic features | |
WO2020177673A1 (en) | Video sequence selection method, computer device and storage medium | |
JP2023546173A (en) | Facial recognition type person re-identification system | |
Zong et al. | Emotion recognition in the wild via sparse transductive transfer linear discriminant analysis | |
US20220415023A1 (en) | Model update method and related apparatus | |
CN113822125B (en) | Processing method and device of lip language recognition model, computer equipment and storage medium | |
CN114519809A (en) | Audio-visual video analysis device and method based on multi-scale semantic network | |
CN110991500A (en) | Small sample multi-classification method based on nested integrated depth support vector machine | |
Agbo-Ajala et al. | A lightweight convolutional neural network for real and apparent age estimation in unconstrained face images | |
CN115147641A (en) | Video classification method based on knowledge distillation and multi-mode fusion | |
Islam et al. | Representation for action recognition with motion vector termed as: SDQIO | |
Aliakbarian et al. | Deep action-and context-aware sequence learning for activity recognition and anticipation | |
Chen et al. | Dual-bottleneck feature pyramid network for multiscale object detection | |
Rastgoo et al. | Word separation in continuous sign language using isolated signs and post-processing | |
US20220086401A1 (en) | System and method for language-guided video analytics at the edge | |
Liu et al. | A multimodal approach for multiple-relation extraction in videos | |
Afrasiabi et al. | Spatial-temporal dual-actor CNN for human interaction prediction in video | |
de Souza et al. | Building semantic understanding beyond deep learning from sound and vision | |
CN116958852A (en) | Video and text matching method and device, electronic equipment and storage medium | |
CN115222047A (en) | Model training method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |