CN115393968A

CN115393968A - Audio-visual event positioning method fusing self-supervision multi-mode features

Info

Publication number: CN115393968A
Application number: CN202211032147.XA
Authority: CN
Inventors: 冉粤
Original assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Current assignee: Shanghai Institute of Microsystem and Information Technology of CAS
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-11-25

Abstract

The invention relates to an audio-visual event positioning method fused with self-supervision multi-mode characteristics, which comprises the following steps: acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal; inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data; the audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected, wherein the visual-auditory feature extraction module and the audiovisual fusion module are mutually independent. The invention can improve the identification accuracy of the audiovisual event.

Description

Audio-visual event positioning method integrating self-supervision multi-mode features

Technical Field

The invention relates to the technical field of audio-visual event positioning, in particular to an audio-visual event positioning method fusing self-supervision multi-mode characteristics.

Background

Humans perceive the surrounding environment through a variety of different senses (e.g., vision, hearing, touch, smell, etc.). While machine learning has made dramatic advances in single modality tasks such as image classification (vision), speech recognition (hearing), and natural language processing (text) at present, to better teach machines to mimic the way humans perceive the real world, many tasks are being proposed that require the coordinated utilization of multiple modality data. These multi-modal tasks are closer to the practical application scenario and research thereon is increasing.

Audiovisual event localization based on both visual and auditory signals is an important application in the field of multimodal video understanding and analysis. For a given piece of video, the algorithm needs to determine when an audiovisual event of interest to a person has occurred therein, and to determine the type of event that has occurred. Because the audiovisual event has both visual and auditory attributes, the information of the image data and the audio data in the video can be fully utilized to more efficiently and accurately position the audiovisual event. On the one hand, although multi-modal data provides more useful information about video content than single modalities, this heterogeneity makes it difficult to fuse information of different modalities due to the natural difference in the way visual and auditory signals are composed. On the other hand, the multi-modal input signal also brings more noise. In addition, the algorithm also needs to process the incoming signal and correctly judge it as background when no audiovisual event of interest has occurred. These backgrounds are often not visually and audibly distinctive and distinctive features of a particular event, making their identification more difficult. In response to these problems, the existing methods generally integrate visual and auditory feature extraction and fusion of two modality information into an end-to-end model, and mainly focus on designing a specific attention module to model the cross-modality relationship during the fusion process. However, in the existing algorithms, there are the following main limitations:

(1) Information exchange between different modalities in the feature extraction stage is often neglected, which can prevent subsequent sufficient information fusion.

(2) In the process of information fusion, only modeling of cross-modal relationships under certain special conditions is considered, so that the algorithm is lack of generality on the whole.

(3) Additional modules or manual supervision are relied upon to identify the background alone when no audiovisual event of interest occurs. These additional modules/manual supervision make the algorithm much less robust in practical applications.

Disclosure of Invention

The invention aims to solve the technical problem of providing an audio-visual event positioning method fusing self-supervision multi-mode characteristics, which can improve the identification accuracy of audio-visual events.

The technical scheme adopted by the invention for solving the technical problems is as follows: the method for positioning the audio-visual event fused with the self-supervision multi-modal features comprises the following steps:

acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal;

inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data;

the audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected; the visual-auditory feature extraction module and the audio-visual fusion module are mutually independent; the visual-auditory feature extraction module is used for respectively extracting space-time features of the image signal and the sound signal by using CNN and Bi-LSTM to obtain visual features and auditory features; the audio-visual fusion module calculates the similarity between asynchronous visual features and auditory features based on cosine distance, and corrects the similarity of feature pairs according to the law of correlation decay in time and then fuses the features; and the classification module classifies based on the fused visual features and auditory features to obtain the event category of each moment in the target video data.

The preprocessing of the target video data specifically comprises:

dividing the acquired target video data into a plurality of equal-length segments, wherein each segment comprises synchronous image data and sound data;

randomly extracting a frame of picture from each section of image data and carrying out random picture cropping and Gaussian blur on the frame of picture to obtain an image frame signal; converting each section of sound data into a log-mel spectrum to obtain a sound spectrum signal;

all the image frame signals and the sound spectrum signals are arranged according to the front-back sequence of time to obtain image signals and sound signals.

The visual-auditory feature extraction module comprises a visual extraction part, an auditory extraction part, a visual projection layer, an auditory projection layer and a cross-correlation matrix unit, wherein the visual extraction part has the same structure as the auditory extraction part and comprises a CNN (CNN) and a Bi-LSTM (Bi-LSTM) which are sequentially connected; the input of the vision extraction part is the image signal and is used for extracting the visual characteristics, and the input of the auditory extraction part is the sound signal and is used for extracting the auditory characteristics; the visual projection layer is used for mapping the visual features to a higher-dimensional semantic space to obtain high-dimensional visual features; the auditory projection layer is used for mapping the auditory features to a semantic space with higher dimensionality to obtain high-dimensional auditory features; the cross-correlation matrix unit is used for solving a cross-correlation matrix between the high-dimensional visual features and the high-dimensional auditory features in a time dimension.

The loss function of the visual-auditory feature extraction module is:

wherein, C _ij Represents the ith row and jth column elements in the cross-correlation matrix, and λ is a hyperparameter to balance the importance of diagonal elements with non-diagonal elements.

The audio-visual fusion module comprises a visual projection matrix part, an auditory projection matrix part, a cross-modal affinity matrix part, a visual updating part, an auditory updating part and an audio-visual fusion part; the visual projection matrix part is used for performing linear projection of learnable parameters on the visual features, and the auditory projection matrix part is used for performing linear projection of learnable parameters on the auditory features; the cross-modal affinity matrix part is used for solving affinity matrixes of the projected visual characteristics and the projected auditory characteristics in time dimension, and correcting the affinity matrixes by adopting a weighting matrix; the vision updating part is used for updating the vision characteristics based on the modified affinity matrix, and the auditory updating part is used for updating the auditory characteristics based on the modified affinity matrix; and the audio-visual fusion part is used for fusing the updated visual features and the auditory features.

The cross-modal affinity matrix is partially passed through

M _va ＝M _av ^T Finding an affinity matrix of the projected visual and auditory signatures by

Element-wise weighting the affinity matrix, wherein M _av And M _va In the form of an affinity matrix, the affinity matrix,

an auditory projection matrix of learnable parameters,

a visual projection matrix of learnable parameters, d a dimension of the feature vector after projection, f _a As a characteristic of hearing, f _v Is visual characteristic, M' _av And M' _va For the modified affinity matrix, W _av And W _va Are affinity matrices M, respectively _av And M _va A corresponding weighting matrix is set to be a weight,

denotes the Hadamard product, softmax (. Cndot.) denotes the softmax function, and relu (. Cndot.) denotes the relu function.

The audio-visual fusion part passes

Fusion of information between modalities is accomplished, wherein,

and

auditory projection matrix, W, of learnable parameters _v ² And W _v ³ Are visual projection matrices of learnable parameters.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: in the process of extracting the visual-auditory characteristics, the invention realizes preliminary information exchange between modalities and reduces the heterogeneity difference between the visual information and the auditory information; in the process of audio-visual information fusion, the dependence on additional manual supervision of the background is reduced, cross-modal information is further fused by aggregating visual and auditory features carrying similar semantic information on the basis of feature extraction, and the identification accuracy of the audio-visual event is improved by the technical means.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a schematic diagram of an audiovisual event localization model in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a visual-auditory feature extraction module in an embodiment of the invention;

fig. 4 is a schematic diagram of an audiovisual fusion module in an embodiment of the invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to an audio-visual event positioning method fused with self-supervision multi-modal characteristics, which comprises the following steps as shown in figure 1: acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal; and inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data.

The audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected; the visual-auditory characteristic extraction module and the audio-visual fusion module are mutually independent; the visual-auditory feature extraction module is used for respectively extracting space-time features of the image signal and the sound signal by using CNN and Bi-LSTM to obtain visual features and auditory features; the audio-visual fusion module calculates the similarity between asynchronous visual features and auditory features based on cosine distance, and corrects the similarity of feature pairs according to the law of correlation attenuation in time and then fuses the features; and the classification module classifies based on the fused visual features and auditory features to obtain the event category of each moment in the target video data.

In the visual-auditory feature extraction process, the embodiment realizes preliminary information exchange between modalities and reduces the heterogeneity difference between visual information and auditory information; in the process of audio-visual information fusion, the dependence on additional manual supervision of the background is reduced, and cross-modal information is further fused by aggregating similar visual and auditory features on the basis of feature extraction. As shown in fig. 2, the audiovisual event localization model in this embodiment is no longer end-to-end as a whole, but is divided into two stages: the feature extraction and the information fusion can be mutually promoted in the two stages, and the efficiency of event positioning is greatly improved.

In the embodiment, the visual-auditory feature extraction module comprises a visual extraction part, an auditory extraction part, a visual projection layer, an auditory projection layer and a cross-correlation matrix unit, wherein the visual extraction part has the same structure as the auditory extraction part and comprises a CNN (CNN) and a Bi-LSTM (Bi-LSTM) which are sequentially connected; the input of the vision extraction part is the image signal and is used for extracting the visual characteristics, and the input of the auditory extraction part is the sound signal and is used for extracting the auditory characteristics; the visual projection layer is used for mapping the visual features to a semantic space with higher dimensionality to obtain high-dimensional visual features; the auditory projection layer is used for mapping the auditory features to a semantic space with higher dimensionality to obtain high-dimensional auditory features; the cross-correlation matrix unit is used for solving a cross-correlation matrix between the high-dimensional visual features and the high-dimensional auditory features in a time dimension.

In the first stage, the feature extraction module is trained in an unsupervised manner, as shown in fig. 3. First, as for an input video S, it is divided into T equal-length segments, and each segment S ^t Involving synchronized image data V ^t And sound data A ^t . During the preprocessing stage of each cycle of the training module, each image data V is processed ^t Randomly extracting a frame of picture and performing random picture cropping and Gaussian blur on the frame of picture to obtain a frame of picture signal ^t (ii) a Each sound data A ^t Converting into log-mel spectrum to obtain sound spectrum signal spec ^t . All image frame signals are converted into frame signals ^t And the sound spectrum signal spec ^t The pre-processed results, i.e. the image signal frame and the sound signal spec, are obtained by arranging them in chronological order and then sent to the feature extraction module. The CNN and the Bi-LSTM are sequentially used in a feature extraction module to extract space-time features of visual and auditory signals to obtain visual features f _v And an auditory feature f _a . Then the visual features f are respectively transmitted through two projection layers _v And an auditory feature f _a Mapping to higher-dimensional semantic space to obtain high-dimensional visual feature z _v And high dimensional auditory feature z _a 。

Since the synchronous visual and auditory signals are generated simultaneously by the same event, they can be regarded as two different expressions of the same event, and naturally have extremely high correlation. In order to enable the feature extraction module to learn the high-quality audio-visual combined features, the embodiment provides a new self-supervision loss function. To calculate the loss function, the visual feature z is first taken in the high dimension _v And high-dimensional hearingCharacteristic z _a Find the cross-correlation matrix C between them in the time dimension, the ith row and the jth column elements C in the cross-correlation matrix C _ij The definition is as follows:

t represents the video-audio segment from t, i and j represent the auditory feature z in high dimension _a And a high dimensional visual feature z _v The middle element corresponds to the position. Then the difference between the cross correlation matrix and the unit diagonal matrix (same dimension) is reduced, and finally the loss function L _AVBT The following were used:

where λ is a hyper-parameter that balances the importance of diagonal elements versus non-diagonal elements. Minimizing L during training _AVBT The feature extraction module can learn how to extract high-quality features, heterogeneous differences among different modal information are reduced by improving semantic similarity of audio-visual features, and information is preliminarily exchanged among synchronous visual and audio signals.

It should be noted that, after the training of the feature extraction module in the first stage is finished, the parameters in the feature module are frozen, and are used to extract the fixed visual features f for the video in the data set _v And an auditory feature f _a For subsequent further cross-modality information fusion.

In this embodiment, the audiovisual fusion module includes a visual projection matrix part, an auditory projection matrix part, a cross-modal affinity matrix part, a visual update part, an auditory update part, and an audiovisual fusion part; the visual projection matrix part is used for performing linear projection of learnable parameters on the visual features, and the auditory projection matrix part is used for performing linear projection of scientific system parameters on the auditory features; the cross-modal affinity matrix part is used for solving affinity matrices of the projected visual features and the projected auditory features in a time dimension, and correcting the affinity matrices by adopting a weighting matrix; the vision updating part is used for updating the vision characteristics based on the modified affinity matrix, and the auditory updating part is used for updating the auditory characteristics based on the modified Qinghe matrix; and the audio-visual fusion part is used for fusing the updated visual features and the auditory features.

In the second stage, the audiovisual fusion module is trained on the basis of the visual features and the auditory features obtained by the training in the first stage and under the supervision of the labels of the training samples, so as to further fuse cross-modal information, as shown in fig. 4. In this stage, the main purpose is to calculate the similarity between asynchronous visual range and auditory feature based on cosine distance, and according to the general rule of correlation decay in time (i.e. the more distant visual and auditory signals have weaker correlation), the similarity of feature pairs is corrected and then the features with high similarity are aggregated. In particular, the visual features f extracted in the feature extraction module _v And an auditory feature f _a As an input, their affinity matrix is solved along the time dimension after linear projection as follows:

wherein,

an auditory projection matrix of learnable parameters,

is a visual projection matrix of learnable parameters, d is the dimension of the feature vector after projection. To M _av And M _va The audiovisual features are weighted element by element using a weight based on the time difference between them, the formula being:

wherein relu (·) represents a relu function for filtering out negatively correlated pairs of audiovisual features; softmax (·) denotes a softmax function for normalizing each row in the matrix;

representing a Hadamard product, enabling weighting of each pair of audiovisual features; w _av And W _va Are respectively affinity matrices M _av And M _va A corresponding weighting matrix. With W _av For example, the calculation formula is:

wherein i 'and j' respectively index the auditory feature at the i 'th time and the visual feature at the j' th time (the absolute value difference between i 'and j' represents the time difference between the visual feature and the auditory feature), and similarly represents the correlation between the two features at M _av The position of (a); exp (·) is an exponential function; theta is a settable hyper-parameter used for balancing the influence of the time difference between the feature pairs and the similarity between the feature pairs on the weighting operation, and can directly influence the final effect of the audio-visual fusion module. Then, the characteristics of the visual and auditory modalities are respectively updated by utilizing the corrected affinity matrix, and the characteristics of different modalities are added to complete the fusion of the information between the modalities, wherein the calculation formula is as follows:

f′ _a ＝W _a ³ (M′ _av (W _a ² f _a )+f _v )

f′ _v ＝W _v ³ (M′ _va (W _v ² f _v )+f _a )

wherein, W _a ² And W _a ³ Auditory projection matrix, W, of learnable parameters _v ² And W _v ³ Are visual projection matrices of learnable parameters.

And after the cross-modal information fusion is completed, adding the fused visual features and the fused auditory features together, and sending the obtained mixture into a classification module. In this embodiment, the classification module is a 3-layer perceptron, and the final classification result calculation formula is as follows:

p＝MLP(layernorm(f′ _a )+layernorm(f′ _v ))

where layerorm (·) is a layer normalization function. For a video divided into T segments, the output prediction p dimension is R ^TxC I.e. to which type of event each fragment belongs or to the background, respectively. In the process of training the fusion module and the classification layer, the loss function is the cross entropy between the output prediction p and the video real label gt.

To verify the effectiveness of this embodiment, a model was constructed based on the pytorch platform and experiments were performed on the AVE dataset.

Experimental data: the AVE data set includes 4143 videos and covers 28 categories of audiovisual events including areas such as musical instrument performance, human behavior, vehicle activity, and animal activity. The data set is divided by default into three subsets: the training set, the verification set and the test set respectively comprise 3339, 402 and 402 videos. The length of each video is 10 seconds, and in the experiment, the picture and the sound of each video are divided into 10 continuous 1s segments with equal length for training the model. In the training phase (including the first phase and the second phase in fig. 2), the parameter training of the model is performed only by using the training set, and the overfitting degree of the model is monitored when the fusion module is trained by using the validation set; in the verification phase, the model makes inferences over the test set.

The model evaluation method comprises the following steps: when locating audiovisual events on an AVE data set, the prediction accuracy for all segments of all test videos is taken as an indicator, following the evaluation of previous algorithms. Because the model in the embodiment is no longer formed in an end-to-end manner, when the feature extraction module is trained in the self-supervision manner in the first stage, in order to measure the quality of the extracted feature, the feature is directly classified and predicted by using the classification layer on the premise of not passing through the audiovisual fusion module of the audiovisual fusion module in the second stage (namely, the feature is directly used for an audiovisual event positioning task without further cross-modal information fusion), and the quality of the self-supervision feature is measured by using the prediction accuracy. The audio-visual fusion module in the second stage takes the self-supervision characteristics learned in the first stage as input, and directly adopts the real labels of the data set for supervision during training, so that the classification prediction obtained after the self-supervision characteristics are fused is more accurate, and the comparison with the existing algorithm is also more fair.

Details of the model: a preprocessing section extracting 160 frames of RGB images having a resolution of 256x256 at a frame rate of 16 for each 10s video; a10 s fixed length sound signal is extracted at a sampling rate of 16KHz, and divided into 10 segments of 1s length, each segment is called short time Fourier transform, and a logarithmic Mel spectrum with a dimension of 96x64 is obtained by a Mel filter. In the feature extraction module, CNN for processing visual signals adopts ResNet-50 pre-trained on ImageNet; the CNN processing the log-mel spectrum adopts VGGish pre-trained on an AudioSet, the projection layers are 3 linear projection layers, and the dimensions are 2048, 4096 and 4096 in sequence from input to output. In the audiovisual fusion module, θ in the weighting matrix is set to 0.03.

Training of the model: loss function L in the process of self-supervision training feature extraction module _AVBT The parameter λ of (2) is set to 5e-3, the model is trained using the LARS optimizer for 100 cycles (the first 10 cycles are used for learning rate warm-up, the last 90 cycles are cosine attenuated), the number of samples per training batch is 128, and the basic learning rate is 2e-4. When the audio-visual fusion module is trained, the initial learning rate is 1e-4, the learning rate is attenuated to 0.98 times after each 10 periods, the size of a training batch is 128, and an Adam optimizer is used.

The experimental results are as follows: in the self-supervision feature extraction part of the feature extraction module, the method is mainly compared with the most popular multi-modal feature extraction method AVEL adopted in the existing algorithm, and the method surpasses the existing method no matter in the condition of inputting single-modal data or multi-modal data, as shown in the table 1.

TABLE 1 quality of the self-supervision characteristics of the process compared to the existing process

Method	Accuracy (%)
		AVEL (Audio input only)	59.5
AVEL (Picture only input)	55.3
		AVEL (Audio-visual input)	71.3
Method (Audio input only)	63.5
		Method (only picture input)	61.0
Method (Audio-visual input)	75.6

In the information fusion part, based on the self-supervision characteristics in the method, after cross-modal information is further fused through the audio-visual fusion module, the result is compared with the existing algorithm, and as shown in table 2, the accuracy of the method is highest.

TABLE 2 comparison of the present method after fusing the self-supervision feature with the existing algorithm

Model (model)	Accuracy (%)
		AVSDN	72.6
DMRN	73.1
		CMAN	73.3
DAM	74.5
		AVRB	74.8
AVIN	75.2
		MPN	75.2
JCAN	76.2
		PSP	76.6
AVT	76.8
		Method for producing a composite material	77.2

In the process of extracting the visual-auditory characteristics, the invention realizes the preliminary information exchange between the modalities and reduces the heterogeneity difference between the visual information and the auditory information; in the process of audio-visual information fusion, the dependence on additional manual supervision of the background is reduced, cross-modal information is further fused by aggregating visual and auditory features carrying similar semantic information on the basis of feature extraction, and the identification accuracy of the audio-visual event is improved by the technical means.

Claims

1. An audio-visual event localization method fused with self-supervision multi-modal features is characterized by comprising the following steps:

acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal; inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data;

the audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected; the visual-auditory feature extraction module and the audio-visual fusion module are mutually independent; the visual-auditory feature extraction module is used for respectively extracting space-time features of the image signal and the sound signal by using CNN and Bi-LSTM to obtain visual features and auditory features; the audio-visual fusion module calculates the similarity between asynchronous visual features and auditory features based on cosine distance, and corrects the similarity of feature pairs according to the law of correlation attenuation in time and then fuses the features; and the classification module classifies the video data based on the fused visual features and auditory features to obtain the event category of each moment in the target video data.

2. The method for audio-visual event localization with fusion of self-supervision multi-modal features according to claim 1, wherein the pre-processing of the target video data is specifically:

randomly extracting a frame of picture from each section of image data and carrying out random picture cropping and Gaussian blur on the frame of picture to obtain an image frame signal; converting each section of voice data into a log-mel spectrum to obtain a voice spectrum signal;

3. The audio-visual event localization method based on fusion self-supervision multi-modal features according to claim 1, wherein the visual-auditory feature extraction module comprises a visual extraction part, an auditory extraction part, a visual projection layer, an auditory projection layer and a cross-correlation matrix unit, wherein the visual extraction part has the same structure as the auditory extraction part and comprises sequentially connected CNN and Bi-LSTM; the input of the vision extraction part is the image signal and is used for extracting the visual characteristics, and the input of the auditory extraction part is the sound signal and is used for extracting the auditory characteristics; the visual projection layer is used for mapping the visual features to a higher-dimensional semantic space to obtain high-dimensional visual features; the auditory projection layer is used for mapping the auditory features to a higher-dimensional semantic space to obtain high-dimensional auditory features; the cross-correlation matrix unit is used for solving a cross-correlation matrix between the high-dimensional visual features and the high-dimensional auditory features in a time dimension.

4. The method of claim 3, wherein the visual-auditory feature extraction module has a loss function of:

wherein, C _ij Represents the ith row and jth column element in the cross-correlation matrix, and λ is a hyperparameter to balance the importance of diagonal elements with non-diagonal elements.

5. The method for locating an audiovisual event fusing self-supervised multi-modal features of claim 1, wherein the audiovisual fusing module comprises a visual projection matrix part, an auditory projection matrix part, a cross-modal affinity matrix part, a visual update part, an auditory update part and an audiovisual fusing part; the visual projection matrix part is used for performing linear projection of learnable parameters on the visual features, and the auditory projection matrix part is used for performing linear projection of learnable parameters on the auditory features; the cross-modal affinity matrix part is used for solving affinity matrices of the projected visual features and the projected auditory features in a time dimension, and correcting the affinity matrices by adopting a weighting matrix; the vision updating part is used for updating the vision characteristics based on the modified affinity matrix, and the auditory updating part is used for updating the auditory characteristics based on the modified affinity matrix; and the audio-visual fusion part is used for fusing the updated visual features and the auditory features.

6. The method of audio-visual event localization with fusion of self-supervised multimodal features according to claim 5, wherein the cross-modal affinity matrix is based in part on

M _va ＝M _av ^T Finding an affinity matrix of the projected visual and auditory features by

Element-wise weighting the affinity matrix, wherein M _av And M _va Is an affinity matrix, and the affinity matrix is a molecular structure,

auditory projection matrix, W, for learnable parameters _v ¹ A visual projection matrix of learnable parameters, d a dimension of the feature vector after projection, f _a As a feature of hearing, f _v Being a visual feature, M' _av And M' _va For the modified affinity matrix, W _av And W _va Are affinity matrices M, respectively _av And M _va A corresponding weighting matrix is set to be a weight,

7. The method of fusing audiovisual event localization of an auto-supervised multimodal feature of claim 6, wherein the audiovisual fusion component is performed by

Fusion of information between modalities is accomplished, wherein,

and

are an auditory projection matrix of learnable parameters,

and

are visual projection matrices of learnable parameters.