CN114329036A

CN114329036A - Cross-modal characteristic fusion system based on attention mechanism

Info

Publication number: CN114329036A
Application number: CN202210256553.8A
Authority: CN
Inventors: 王青; 兰浩源; 刘阳; 林倞
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-04-12
Anticipated expiration: 2042-03-16
Also published as: CN114329036B

Abstract

The invention provides a cross-modal characteristic fusion system based on an attention mechanism, which is based on the complementary relation of audio and video images, provides a method for extracting the characteristics of two modes of audio and video by using supervised contrast learning as a framework, simultaneously constructs an audio and video association analysis module to realize audio and video alignment, and designs a cross-modal characteristic fusion module based on the attention mechanism to realize the fusion of the audio and video characteristics. The audio and RGB pictures are used as input to achieve the goal of learning the video representation.

Description

Cross-modal characteristic fusion system based on attention mechanism

Technical Field

The invention relates to the technical field of audio and video processing, in particular to a cross-modal feature fusion system based on an attention mechanism.

Background

For video representation learning, a number of supervised learning methods are receiving increasing attention. The method includes a conventional method and a deep learning method. For example, the dual-flow CNN determines the video image and the dense optical flow respectively, and then directly fuses the class scores of the two networks to obtain a classification result. C3D processes the video using a three-dimensional convolution kernel. The Time Segment Network (TSN) samples each video into several segments to model the remote temporal structure of the video. A Temporal Relationship Network (TRN) introduces an interpretable network to learn and infer temporal dependencies between video frames at multiple time scales. A Time Shift Module (TSM) moves portions of the channel along the time dimension to facilitate information exchange between adjacent frames. Although these surveillance methods achieve better performance in modeling time dependencies, most of them only extract information on RGB image modalities in video. With the development of the multi-modal field, researchers have begun to attempt to introduce multi-modal learning into video representation learning. Because of the dynamics and time strictness of the video, if the dynamic characteristics of the video can be learned, the learning capability of the network on the video characteristics can be undoubtedly improved. The optical flow is the instantaneous speed of the pixel motion of a space moving object on an observation imaging plane, and is a method for finding the corresponding relation between the previous frame and the current frame by using the change of the pixels in the image sequence on a time domain and the correlation between adjacent frames so as to calculate the motion information of the object between the adjacent frames. Since optical flow preferably includes motion information of video, most researchers use optical flow as a modality to improve the performance of video representation learning.

Although the RGB image contains static information of the video and the optical flow contains dynamic information of the video, the optical flow is also a modality based on the RGB image generation and is not completely independent of the RGB image modality; in the existing 3D convolution network, the dynamic information in the input image segment can be well extracted. Thus, the utilization of the optical flow modality reaches a bottleneck. In video, besides rich picture information, there is a lot of effective sound information. For example, in the behavior of cutting trees, the sound of cutting trees is often accompanied; some studies in the "baseball" practice, also accompanied by the sound of a bat hitting a baseball, have also demonstrated the effectiveness of audio. In the past related work, the network is trained by judging whether the audios and videos are aligned or not and judging whether the audios and videos belong to the same sample or not. Although the method can better perform information interaction between modalities, the method cannot solve the problems that the intra-class sample difference is large and the inter-class sample difference is small. Although these methods can achieve better characteristics, they have one disadvantage: the relevance of features between actions belonging to the same category is not taken into account.

The prior art discloses a patent of a bimodal emotion recognition method based on multimode deep learning, which respectively obtains three-channel input matrixes of audio and video RGB images and obtains an audio data sample and a video data sample; constructing an audio deep convolution neural network and a video deep convolution neural network to obtain high-level audio characteristics and high-level video characteristics; establishing a fusion network formed by full connection layers, and constructing a high-level audio and video unified characteristic; aggregating the audio and video unified features output by the last layer of full connection layer of the fusion network into global features, and inputting the global features into a classifier to obtain audio and video emotion recognition classification results; the fusion network formed by full connection layers is adopted to realize the fusion of audio and video emotion information, construct high-level audio and video uniform characteristic representation and effectively improve the audio and video emotion recognition performance. However, this invention does not relate to any technical content relating to learning a video representation using audio and RGB pictures as input.

Disclosure of Invention

The invention provides a cross-modal characteristic fusion system based on an attention mechanism, which realizes the fusion of audio and video characteristics and takes audio and RGB pictures as input to learn the video representation.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a cross-modal feature fusion system based on an attention mechanism, comprising:

the audio and video correlation analysis module is used for aligning two modes of audio and video RGB images;

the supervised contrast learning module is used for extracting the characteristics of the modes from the two modes of the audio and video RGB images;

and the cross-modal feature fusion module is used for learning the global context representation by utilizing the related knowledge between the modalities.

Further, the audio and video correlation analysis module continuously acquires RGB fragments v generated by RGB images of 16 frames from a section of video i_iAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segment_iAs an input to an audio modality; wherein i =1, ·, N.

Further, the specific processing procedure of the supervised contrast learning module is as follows:

1) and modal feature extraction: RGB segment v of video i_iThe features extracted after the 3D convolution network with r3D as the basic frame are

Corresponding to the Mel spectrogram a of audio_iThe features extracted after the 3D convolution network of the audio are

；

2) Generating self-supervision comparison loss through self-supervision comparison learning;

3) and generating the supervised contrast loss through the supervised contrast learning.

Further, the specific process of generating the self-supervision contrast loss by the self-supervision contrast learning is as follows:

directly facing

,

}_i=1,···,NExpressed as: RGB segment features from the same video i

And Mel frequency spectrum characteristics generated corresponding to audio

(ii) a Negative couple

,

}_{i, j =1, ·, N and i ≠ j}∪{

,

}_{i, j =1, ·, N and i ≠ j}Expressed as: RGB segment features derived from video i production

And all RGB slice features derived from video j (i ≠ j) generation

And Mel frequency spectrum characteristics

(ii) a The self-supervised contrast loss for the RGB image modality of the video is expressed as:

wherein the content of the first and second substances,

is a scalar temperature parameter, where the numerator is the sum of all positive and negative sample distances and the denominator is the sum of all positive and negative sample distances;

similarly, the self-supervised contrast loss for audio modalities is:

the overall self-supervision contrast loss is given by the equations (1) and (2):

。

further, the specific process of generating the supervised contrast loss by the supervised contrast learning is as follows:

directly facing

,

}_{i, j =1, ·, N, and yi = yj}∪{

,

}_{i, j =1, ·, N with i ≠ j and yi = yj}Expressed as:

and all RGB segment features from video i and video j of the same class

And Mel-frequency-map features generated by audio

，

(ii) a The rest are negative pairs; the supervised contrast loss formula is as follows:

the supervised contrast loss for the RGB image modality of video is expressed as:

wherein the content of the first and second substances,

similarly, the supervised contrast loss for audio modalities is:

the overall supervised contrast loss is given by equations (4) (5):

。

further, the cross-modal feature fusion module receives features from different modalities and learns global context embedding, which is then used to recalibrate input features from different segments, using video segment features learned from the supervised contrast learning framework as inputs, fused features as outputs, and computing the loss function of the fused portion by cross entropy.

Further, the specific processing procedure of the cross-modal feature fusion module is as follows:

two modes of a video i are v_iAnd a_iThe feature extracted from the supervised contrast learning frame by the three-dimensional convolution network is

,

To exploit the correlation between the two modalities, the two feature vectors are connected and a joint representation is obtained through the full connection layer:

wherein [, ]]It is shown that the connection operation is performed,

represents a joint representation, W_sAnd b_sIs the weight and offset of the fully connected layer; selecting

To limit model capacity and increase its generalization ability in order to exploit the joint representation Z_uThe global context information of the aggregation, its excitation signal is predicted by a full connection layer:

wherein W_eAnd b_eIs the weight and deviation of the full connection layer, and the excitation signal is obtained

Thereafter, it is used to adaptively recalibrate the input features through a simple gating mechanism

，

：

Wherein [ ] is the channel product operation of each element in the channel dimension, δ (·) is a linear rectifying function, in this way, allowing the feature of one segment to recalibrate the feature of another segment while preserving the correlation between the different segments;

the two refined feature vectors

,

Connected and input into the full connection layer with the normalized exponential function soft-max as classification output, with cross entropy loss used to measure the correctness of the classification:

wherein y is_iAnd p_iRespectively representing the probability that the sample belongs to class i in the real case and the prediction, and C representing the number of all possible sequences;

the overall loss function is obtained by combining the formula (6) and the formula (10), wherein_supAnd λ_crossAre controlled separately

And

the contribution of (c) is as follows:

。

further, each video segment is sized to have a size of

Where c is the number of channels, l is the number of frames, and h and w represent the height and width of the frames.

Further, the size of the 3D convolution kernel is t × D × D, where t is the time length and D is the spatial size; defining a video RGB image sequence as

Wherein v is_iAn RGB segment is generated for sampling m frames consecutively from a video i (i = 1...., N).

Further, the audio modality is a Mel frequency spectrogram generated by short-time Fourier transform of the whole audio of a video; aligning a segment of a video RGB image and a Mel frequency spectrogram generated by the whole video as input; the audio Mel-spectrogram sequence is represented as

Wherein a is_iA mel-frequency spectrogram generated for audio extracted from a video i.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a method for extracting characteristics of two modes of audio and video by using supervised contrast learning as a framework based on a complementary relation of information of the audio and video images, constructs an audio and video association analysis module to realize audio and video alignment, and designs a cross-mode characteristic fusion module based on an attention mechanism to realize the fusion of the audio and video characteristics. The audio and RGB pictures are used as input to achieve the goal of learning the video representation.

Drawings

FIG. 1 is a block diagram of the overall process of the system of the present invention;

FIG. 2 is an exemplary diagram of audio/video comparison learning according to the present invention;

FIG. 3 is a block diagram of a Supervised Contrast Learning (SCL) process of the present invention;

fig. 4 is a framework diagram of a cross-modal feature fusion module (MFAM) processing procedure in the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, a cross-modal feature fusion system based on attention mechanism includes:

An audio and video correlation analysis module continuously acquires RGB fragments v generated by 16 frames of RGB images from a video i_iAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segment_iAs an input to an audio modality; wherein i =1, ·, N.

The specific processing procedure of the supervised contrast learning module is as follows:

；

And aligning two modes of an audio and video RGB image through an audio and video association analysis (AVCA) module. In the module, for each video, a video RGB image mode and an audio mode are used as input. The video RGB image mode is that a segment of video randomly samples 16 continuous frames to form a segment. The audio modality is a Mel frequency spectrogram generated by short-time Fourier transform of the whole audio of a video. At this time, a segment of the video RGB image is aligned with the generated mel-frequency spectrogram of the whole video as an input.

In a Supervised Contrast Learning (SCL) module, spatio-temporal features are extracted for video RGB image segments and audio mel-frequency spectrograms respectively using two different three-dimensional convolutional neural networks (3D CNNs), and all 3D CNNs share the same weights. Then, supervised contrast loss is designed for the features generated by the two modalities to enhance the discriminative power of the homogeneous sample representation learning.

In multi-modal fusion, a cross-modal feature fusion module (MFAM) based on an attention mechanism is introduced, and features learned from a supervised contrast learning framework are adaptively propagated through the MFAM module and channel features are adaptively recalibrated. And calculating a loss function through cross entropy after connecting the calibrated features.

Example 2

；

The specific process of generating the self-supervision contrast loss by the self-supervision contrast learning is as follows:

directly facing

,

}_i=1,···,NExpressed as: RGB segment features from the same video i

And Mel frequency spectrum characteristics generated corresponding to audio

(ii) a Negative couple

,

}_{i, j =1, ·, N and i ≠ j}∪{

,

And all RGB slice features derived from video j (i ≠ j) generation

And Mel frequency spectrum characteristics

wherein the content of the first and second substances,

similarly, the self-supervised contrast loss for audio modalities is:

。

the specific process of generating the supervised contrast loss by the supervised contrast learning is as follows:

directly facing

,

}_{i, j =1, ·, N, and yi = yj}∪{

,

}_{i, j =1, ·, N with i ≠ j and yi = yj}Expressed as:

and all RGB segment features from video i and video j of the same class

And Mel-frequency-map features generated by audio

，

wherein the content of the first and second substances,

similarly, the supervised contrast loss for audio modalities is:

the overall supervised contrast loss is given by equations (4) (5):

。

the cross-modal feature fusion module receives features from different modalities and learns global context embedding, then the embedding is used for recalibrating input features from different segments, video segment features learned from a supervised contrast learning framework are used as input, the fused features are used as output, and loss functions of a fusion part are calculated through cross entropy.

The specific processing process of the cross-modal feature fusion module is as follows:

,

wherein [, ]]It is shown that the connection operation is performed,

，

：

the two refined feature vectors

,

And

the contribution of (c) is as follows:

。

example 3

To facilitate the description of each module, given N different videos, the segments of each video are sized by

Where c is the number of channels, l is the number of frames, and h and w represent the height and width of the frames. The size of the 3D convolution kernel is t × D × D, where t is the time length and D is the spatial size; defining a video RGB image sequence as

Wherein v is_iGenerated for successive sampling of m frames from one video i (i = 1...., N)An RGB segment. The audio mode is a Mel frequency spectrogram generated by the whole audio of a section of video through short-time Fourier transform; aligning a segment of a video RGB image and a Mel frequency spectrogram generated by the whole video as input; the audio Mel-spectrogram sequence is represented as

Is a category label for video i.

1) Audio video association analysis (audio video alignment)

The sound signal is a one-dimensional signal, and only time domain information can be visually seen, but frequency domain information cannot be seen. And the frequency domain can be transformed by Fourier Transform (FT), but time domain information is lost, and the time-frequency relation cannot be seen. In order to solve the problem, a plurality of methods are generated, and short-time Fourier transform, wavelet and the like are common time-frequency analysis methods.

Short-time fourier transform (STFT) is a fourier transform of a short-time signal. The principle is as follows: the method comprises the steps of framing and windowing a long speech signal, performing Fourier transform on each frame, and stacking results of each frame along the other dimension to obtain a graph (similar to a two-dimensional signal), wherein the graph is a spectrogram.

Since the obtained spectrogram is large, in order to obtain a sound feature with a proper size, it is usually passed through a Mel-scale filter banks (Mel-scale filter banks) to become a Mel-scale spectrum.

In the conventional audio-video alignment, mostly, a mel frequency spectrum generated by an RGB image corresponding to a certain time length audio is aligned. This method can align the two modalities and extract still image information and audio information of the video, but ignores timing information included in the video itself.

In order to utilize the time sequence information of the video, the invention looks from one segmentRGB segments v produced by the continuous acquisition of 16 frames of RGB images in frequency i (i = 1...., N)_iAs input to the RGB image modality. At this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segment_iAs input to an audio modality.

1.1) Audio video comparison learning

In the field of images, self-supervision contrast learning is a common learning method. The core idea is that the data features from the same sample are as close as possible, while the data features from different samples are as far apart as possible. In the field of images, data is generally augmented by turning or cropping a picture, so that a positive pair is formed between the generated picture and an original picture, and a negative pair is formed between all other pictures and the original picture. The positive pair is close and the negative pair is far by contrast loss.

In order to make the features of similar actions similar to each other, a new comparative learning method is proposed by researchers: there is supervised contrast learning. The core idea is that the distance between data features from the same category is as close as possible, while the distance between data features from different categories is as far as possible. At this time, the positive pair is extended to the picture generated by the original picture augmentation and the picture having the same category information as the original picture, and the negative pair is all the pictures not belonging to the same category as the original picture.

Although the contrast learning method has been widely applied in the field of image learning, some students also introduce the method into video representation learning, but the combination of contrast learning in the multi-modal field is proposed in recent years. In the multi-modal field, most scholars only learn the RGB images and the optical flow as two modalities, and the audio is used as one of the modalities, which is much less and less. Therefore, the method introduces supervised contrast learning into audio/video multi-mode learning, so that the model can better extract different modal characteristics and can better distinguish samples with large intra-class difference and small inter-class difference.

2) Modal feature extraction

RGB segment v of video i_iThe features extracted after the 3D convolution network with r3D as the basic frame are

. The difference between the two networks is the number of channels of the image input.

3) Supervised contrast learning

3.1) self-supervised contrast learning

As shown in fig. 3, there is a Supervised Contrast Learning (SCL) framework: the kernel of the self-supervised contrast learning is that the distance between data features from the same sample is as close as possible, and the data features from different samples are as far as possible.

In the invention, the RGB fragment characteristics of the video i are used

As an example, a directly facing

,

}_i=1,···,NExpressed as: RGB segment features from the same video i

And Mel frequency spectrum characteristics generated corresponding to audio

(ii) a Negative couple

,

}_{i, j =1, ·, N and i ≠ j}∪{

,

And all RGB slice features derived from video j (i ≠ j) generation

And Mel frequency spectrum characteristics

. As shown in fig. 2. At this time, the self-supervised contrast loss of the RGB image modality of the video is expressed as:

wherein the content of the first and second substances,

similarly, the self-supervised contrast loss for audio modalities is:

。

3.2) supervised contrast learning

Although self-supervised contrast learning can learn better features, there is a disadvantage: the relevance of features between actions belonging to the same category is not considered; in order to make the characteristics of the same kind of actions close to each other, a new comparison learning method is adopted: there is supervised contrast. The core of supervised contrast learning is to make the distance between data features from the same class sample as close as possible, and the distance between data features from different classes as far as possible.

In the present invention, the RGB fragment features

As an example, a directly facing

,

}_{i, j =1, ·, N, and yi = yj}∪{

,

}_{i, j =1, ·, N with i ≠ j and yi = yj}Expressed as:

and all RGB segment features from video i and video j of the same class

And Mel-frequency-map features generated by audio

，

wherein the content of the first and second substances,

similarly, the supervised contrast loss for audio modalities is:

the overall supervised contrast loss is given by equations (4) (5):

。

4) multimodal fusion

As shown in fig. 4, in order to better fuse information between different modalities, a cross-modality feature fusion (MFAM) module based on attention mechanism is proposed. Since features from different modalities are correlated, a cross-modality feature fusion module is constructed that receives features from different modalities and learns global context embedding, which is then used to re-align input features from different segments, as shown in FIG. 4. And (3) using the video segment characteristics learned from the supervised contrast learning framework as input, using the fused characteristics as output, and calculating the loss function of the fusion part through cross entropy.

To fix the symbols, assume that the two modalities of a video i are v_iAnd a_iThe feature extracted from the supervised contrast learning frame by the three-dimensional convolution network is

,

wherein [, ]]It is shown that the connection operation is performed,

，

：

the two refined feature vectors

,

And

the contribution of (c) is as follows:

。

the same or similar reference numerals correspond to the same or similar parts;

the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A cross-modal feature fusion system based on an attention mechanism, comprising:

2. The attention-based cross-modal feature fusion system of claim 1, wherein the audio-video correlation analysis module continuously collects RGB segments v generated by 16 RGB images from a video i_iAs input to an RGB image modality; at this time, only one segment is sampled from a video segment, and in order to fully utilize the effective audio information in a video segment, the audio extracted from the entire video segment i is converted into the mel frequency spectrogram a of the video segment_iAs an input to an audio modality; wherein i =1, ·, N.

3. The attention-based cross-modal feature fusion system of claim 2, wherein the specific processing procedure of the supervised contrast learning module is as follows:

；

4. The cross-modal feature fusion system based on attention mechanism as claimed in claim 3 wherein the specific process of generating the self-supervised contrast loss by the self-supervised contrast learning is:

directly facing

,

}_i=1,···,NExpressed as: RGB segment features from the same video i

And Mel frequency spectrum characteristics generated corresponding to audio

(ii) a Negative couple

,

}_{i, j =1, ·, N and i ≠ j}∪{

,

And all RGB slice features derived from video j (i ≠ j) generation

And Mel frequency spectrum characteristics

wherein the content of the first and second substances,

similarly, the self-supervised contrast loss for audio modalities is:

。

5. the attention-based cross-modal feature fusion system of claim 4, wherein the specific process of generating supervised contrast loss by supervised contrast learning is as follows:

directly facing

,

}_{i, j =1, ·, N, and yi = yj}∪{

,

}_{i, j =1, ·, N with i ≠ j and yi = yj}Expressed as:

and all RGB segment features from video i and video j of the same class

And Mel-frequency-map features generated by audio

，

wherein the content of the first and second substances,

similarly, the supervised contrast loss for audio modalities is:

the overall supervised contrast loss is given by equations (4) (5):

。

6. the attention-based cross-modal feature fusion system of claim 5 wherein the cross-modal feature fusion module receives features from different modalities and learns global context embedding, which is then used to recalibrate input features from different segments, using video segment features learned from the supervised contrast learning framework as input, fused features as output, and computing the loss function of the fused portion by cross entropy.

7. The attention-based cross-modal feature fusion system of claim 6, wherein the specific processing procedure of the cross-modal feature fusion module is:

two modes of a video i are v_iAnd a_iThe slave has supervisionThe characteristic extracted by the three-dimensional convolution network in the Du-contrast learning frame is

,

wherein [, ]]It is shown that the connection operation is performed,

，

：

the two refined feature vectors

,

And

the contribution of (c) is as follows:

。

8. the attention-based cross-modal feature fusion system of any one of claims 1-7 wherein each video segment is of a size of

9. The attention-based cross-modal feature fusion system of claim 8 wherein the size of the 3D convolution kernel is txdxd, where t is the temporal length and D is the spatial size; defining a video RGB image sequence as

10. The attention-based cross-modal feature fusion system of claim 9, wherein the audio modality is a mel-frequency spectrogram generated by a short-time fourier transform of an entire audio segment of a video; aligning a segment of a video RGB image and a Mel frequency spectrogram generated by the whole video as input; the audio Mel-spectrogram sequence is represented as