CN115393968A - Audio-visual event positioning method fusing self-supervision multi-mode features - Google Patents

Audio-visual event positioning method fusing self-supervision multi-mode features Download PDF

Info

Publication number
CN115393968A
CN115393968A CN202211032147.XA CN202211032147A CN115393968A CN 115393968 A CN115393968 A CN 115393968A CN 202211032147 A CN202211032147 A CN 202211032147A CN 115393968 A CN115393968 A CN 115393968A
Authority
CN
China
Prior art keywords
visual
auditory
features
matrix
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211032147.XA
Other languages
Chinese (zh)
Inventor
冉粤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Institute of Microsystem and Information Technology of CAS
Original Assignee
Shanghai Institute of Microsystem and Information Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Institute of Microsystem and Information Technology of CAS filed Critical Shanghai Institute of Microsystem and Information Technology of CAS
Priority to CN202211032147.XA priority Critical patent/CN115393968A/en
Publication of CN115393968A publication Critical patent/CN115393968A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/70Multimodal biometrics, e.g. combining information from different biometric modalities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to an audio-visual event positioning method fused with self-supervision multi-mode characteristics, which comprises the following steps: acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal; inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data; the audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected, wherein the visual-auditory feature extraction module and the audiovisual fusion module are mutually independent. The invention can improve the identification accuracy of the audiovisual event.

Description

Audio-visual event positioning method integrating self-supervision multi-mode features
Technical Field
The invention relates to the technical field of audio-visual event positioning, in particular to an audio-visual event positioning method fusing self-supervision multi-mode characteristics.
Background
Humans perceive the surrounding environment through a variety of different senses (e.g., vision, hearing, touch, smell, etc.). While machine learning has made dramatic advances in single modality tasks such as image classification (vision), speech recognition (hearing), and natural language processing (text) at present, to better teach machines to mimic the way humans perceive the real world, many tasks are being proposed that require the coordinated utilization of multiple modality data. These multi-modal tasks are closer to the practical application scenario and research thereon is increasing.
Audiovisual event localization based on both visual and auditory signals is an important application in the field of multimodal video understanding and analysis. For a given piece of video, the algorithm needs to determine when an audiovisual event of interest to a person has occurred therein, and to determine the type of event that has occurred. Because the audiovisual event has both visual and auditory attributes, the information of the image data and the audio data in the video can be fully utilized to more efficiently and accurately position the audiovisual event. On the one hand, although multi-modal data provides more useful information about video content than single modalities, this heterogeneity makes it difficult to fuse information of different modalities due to the natural difference in the way visual and auditory signals are composed. On the other hand, the multi-modal input signal also brings more noise. In addition, the algorithm also needs to process the incoming signal and correctly judge it as background when no audiovisual event of interest has occurred. These backgrounds are often not visually and audibly distinctive and distinctive features of a particular event, making their identification more difficult. In response to these problems, the existing methods generally integrate visual and auditory feature extraction and fusion of two modality information into an end-to-end model, and mainly focus on designing a specific attention module to model the cross-modality relationship during the fusion process. However, in the existing algorithms, there are the following main limitations:
(1) Information exchange between different modalities in the feature extraction stage is often neglected, which can prevent subsequent sufficient information fusion.
(2) In the process of information fusion, only modeling of cross-modal relationships under certain special conditions is considered, so that the algorithm is lack of generality on the whole.
(3) Additional modules or manual supervision are relied upon to identify the background alone when no audiovisual event of interest occurs. These additional modules/manual supervision make the algorithm much less robust in practical applications.
Disclosure of Invention
The invention aims to solve the technical problem of providing an audio-visual event positioning method fusing self-supervision multi-mode characteristics, which can improve the identification accuracy of audio-visual events.
The technical scheme adopted by the invention for solving the technical problems is as follows: the method for positioning the audio-visual event fused with the self-supervision multi-modal features comprises the following steps:
acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal;
inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data;
the audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected; the visual-auditory feature extraction module and the audio-visual fusion module are mutually independent; the visual-auditory feature extraction module is used for respectively extracting space-time features of the image signal and the sound signal by using CNN and Bi-LSTM to obtain visual features and auditory features; the audio-visual fusion module calculates the similarity between asynchronous visual features and auditory features based on cosine distance, and corrects the similarity of feature pairs according to the law of correlation decay in time and then fuses the features; and the classification module classifies based on the fused visual features and auditory features to obtain the event category of each moment in the target video data.
The preprocessing of the target video data specifically comprises:
dividing the acquired target video data into a plurality of equal-length segments, wherein each segment comprises synchronous image data and sound data;
randomly extracting a frame of picture from each section of image data and carrying out random picture cropping and Gaussian blur on the frame of picture to obtain an image frame signal; converting each section of sound data into a log-mel spectrum to obtain a sound spectrum signal;
all the image frame signals and the sound spectrum signals are arranged according to the front-back sequence of time to obtain image signals and sound signals.
The visual-auditory feature extraction module comprises a visual extraction part, an auditory extraction part, a visual projection layer, an auditory projection layer and a cross-correlation matrix unit, wherein the visual extraction part has the same structure as the auditory extraction part and comprises a CNN (CNN) and a Bi-LSTM (Bi-LSTM) which are sequentially connected; the input of the vision extraction part is the image signal and is used for extracting the visual characteristics, and the input of the auditory extraction part is the sound signal and is used for extracting the auditory characteristics; the visual projection layer is used for mapping the visual features to a higher-dimensional semantic space to obtain high-dimensional visual features; the auditory projection layer is used for mapping the auditory features to a semantic space with higher dimensionality to obtain high-dimensional auditory features; the cross-correlation matrix unit is used for solving a cross-correlation matrix between the high-dimensional visual features and the high-dimensional auditory features in a time dimension.
The loss function of the visual-auditory feature extraction module is:
Figure BDA0003817872100000031
wherein, C ij Represents the ith row and jth column elements in the cross-correlation matrix, and λ is a hyperparameter to balance the importance of diagonal elements with non-diagonal elements.
The audio-visual fusion module comprises a visual projection matrix part, an auditory projection matrix part, a cross-modal affinity matrix part, a visual updating part, an auditory updating part and an audio-visual fusion part; the visual projection matrix part is used for performing linear projection of learnable parameters on the visual features, and the auditory projection matrix part is used for performing linear projection of learnable parameters on the auditory features; the cross-modal affinity matrix part is used for solving affinity matrixes of the projected visual characteristics and the projected auditory characteristics in time dimension, and correcting the affinity matrixes by adopting a weighting matrix; the vision updating part is used for updating the vision characteristics based on the modified affinity matrix, and the auditory updating part is used for updating the auditory characteristics based on the modified affinity matrix; and the audio-visual fusion part is used for fusing the updated visual features and the auditory features.
The cross-modal affinity matrix is partially passed through
Figure BDA0003817872100000032
M va =M av T Finding an affinity matrix of the projected visual and auditory signatures by
Figure BDA0003817872100000033
Element-wise weighting the affinity matrix, wherein M av And M va In the form of an affinity matrix, the affinity matrix,
Figure BDA0003817872100000034
an auditory projection matrix of learnable parameters,
Figure BDA0003817872100000035
a visual projection matrix of learnable parameters, d a dimension of the feature vector after projection, f a As a characteristic of hearing, f v Is visual characteristic, M' av And M' va For the modified affinity matrix, W av And W va Are affinity matrices M, respectively av And M va A corresponding weighting matrix is set to be a weight,
Figure BDA0003817872100000036
denotes the Hadamard product, softmax (. Cndot.) denotes the softmax function, and relu (. Cndot.) denotes the relu function.
The audio-visual fusion part passes
Figure BDA0003817872100000037
Fusion of information between modalities is accomplished, wherein,
Figure BDA0003817872100000038
and
Figure BDA0003817872100000039
auditory projection matrix, W, of learnable parameters v 2 And W v 3 Are visual projection matrices of learnable parameters.
Advantageous effects
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: in the process of extracting the visual-auditory characteristics, the invention realizes preliminary information exchange between modalities and reduces the heterogeneity difference between the visual information and the auditory information; in the process of audio-visual information fusion, the dependence on additional manual supervision of the background is reduced, cross-modal information is further fused by aggregating visual and auditory features carrying similar semantic information on the basis of feature extraction, and the identification accuracy of the audio-visual event is improved by the technical means.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram of an audiovisual event localization model in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a visual-auditory feature extraction module in an embodiment of the invention;
fig. 4 is a schematic diagram of an audiovisual fusion module in an embodiment of the invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The embodiment of the invention relates to an audio-visual event positioning method fused with self-supervision multi-modal characteristics, which comprises the following steps as shown in figure 1: acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal; and inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data.
The audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected; the visual-auditory characteristic extraction module and the audio-visual fusion module are mutually independent; the visual-auditory feature extraction module is used for respectively extracting space-time features of the image signal and the sound signal by using CNN and Bi-LSTM to obtain visual features and auditory features; the audio-visual fusion module calculates the similarity between asynchronous visual features and auditory features based on cosine distance, and corrects the similarity of feature pairs according to the law of correlation attenuation in time and then fuses the features; and the classification module classifies based on the fused visual features and auditory features to obtain the event category of each moment in the target video data.
In the visual-auditory feature extraction process, the embodiment realizes preliminary information exchange between modalities and reduces the heterogeneity difference between visual information and auditory information; in the process of audio-visual information fusion, the dependence on additional manual supervision of the background is reduced, and cross-modal information is further fused by aggregating similar visual and auditory features on the basis of feature extraction. As shown in fig. 2, the audiovisual event localization model in this embodiment is no longer end-to-end as a whole, but is divided into two stages: the feature extraction and the information fusion can be mutually promoted in the two stages, and the efficiency of event positioning is greatly improved.
In the embodiment, the visual-auditory feature extraction module comprises a visual extraction part, an auditory extraction part, a visual projection layer, an auditory projection layer and a cross-correlation matrix unit, wherein the visual extraction part has the same structure as the auditory extraction part and comprises a CNN (CNN) and a Bi-LSTM (Bi-LSTM) which are sequentially connected; the input of the vision extraction part is the image signal and is used for extracting the visual characteristics, and the input of the auditory extraction part is the sound signal and is used for extracting the auditory characteristics; the visual projection layer is used for mapping the visual features to a semantic space with higher dimensionality to obtain high-dimensional visual features; the auditory projection layer is used for mapping the auditory features to a semantic space with higher dimensionality to obtain high-dimensional auditory features; the cross-correlation matrix unit is used for solving a cross-correlation matrix between the high-dimensional visual features and the high-dimensional auditory features in a time dimension.
In the first stage, the feature extraction module is trained in an unsupervised manner, as shown in fig. 3. First, as for an input video S, it is divided into T equal-length segments, and each segment S t Involving synchronized image data V t And sound data A t . During the preprocessing stage of each cycle of the training module, each image data V is processed t Randomly extracting a frame of picture and performing random picture cropping and Gaussian blur on the frame of picture to obtain a frame of picture signal t (ii) a Each sound data A t Converting into log-mel spectrum to obtain sound spectrum signal spec t . All image frame signals are converted into frame signals t And the sound spectrum signal spec t The pre-processed results, i.e. the image signal frame and the sound signal spec, are obtained by arranging them in chronological order and then sent to the feature extraction module. The CNN and the Bi-LSTM are sequentially used in a feature extraction module to extract space-time features of visual and auditory signals to obtain visual features f v And an auditory feature f a . Then the visual features f are respectively transmitted through two projection layers v And an auditory feature f a Mapping to higher-dimensional semantic space to obtain high-dimensional visual feature z v And high dimensional auditory feature z a
Since the synchronous visual and auditory signals are generated simultaneously by the same event, they can be regarded as two different expressions of the same event, and naturally have extremely high correlation. In order to enable the feature extraction module to learn the high-quality audio-visual combined features, the embodiment provides a new self-supervision loss function. To calculate the loss function, the visual feature z is first taken in the high dimension v And high-dimensional hearingCharacteristic z a Find the cross-correlation matrix C between them in the time dimension, the ith row and the jth column elements C in the cross-correlation matrix C ij The definition is as follows:
Figure BDA0003817872100000051
t represents the video-audio segment from t, i and j represent the auditory feature z in high dimension a And a high dimensional visual feature z v The middle element corresponds to the position. Then the difference between the cross correlation matrix and the unit diagonal matrix (same dimension) is reduced, and finally the loss function L AVBT The following were used:
Figure BDA0003817872100000052
where λ is a hyper-parameter that balances the importance of diagonal elements versus non-diagonal elements. Minimizing L during training AVBT The feature extraction module can learn how to extract high-quality features, heterogeneous differences among different modal information are reduced by improving semantic similarity of audio-visual features, and information is preliminarily exchanged among synchronous visual and audio signals.
It should be noted that, after the training of the feature extraction module in the first stage is finished, the parameters in the feature module are frozen, and are used to extract the fixed visual features f for the video in the data set v And an auditory feature f a For subsequent further cross-modality information fusion.
In this embodiment, the audiovisual fusion module includes a visual projection matrix part, an auditory projection matrix part, a cross-modal affinity matrix part, a visual update part, an auditory update part, and an audiovisual fusion part; the visual projection matrix part is used for performing linear projection of learnable parameters on the visual features, and the auditory projection matrix part is used for performing linear projection of scientific system parameters on the auditory features; the cross-modal affinity matrix part is used for solving affinity matrices of the projected visual features and the projected auditory features in a time dimension, and correcting the affinity matrices by adopting a weighting matrix; the vision updating part is used for updating the vision characteristics based on the modified affinity matrix, and the auditory updating part is used for updating the auditory characteristics based on the modified Qinghe matrix; and the audio-visual fusion part is used for fusing the updated visual features and the auditory features.
In the second stage, the audiovisual fusion module is trained on the basis of the visual features and the auditory features obtained by the training in the first stage and under the supervision of the labels of the training samples, so as to further fuse cross-modal information, as shown in fig. 4. In this stage, the main purpose is to calculate the similarity between asynchronous visual range and auditory feature based on cosine distance, and according to the general rule of correlation decay in time (i.e. the more distant visual and auditory signals have weaker correlation), the similarity of feature pairs is corrected and then the features with high similarity are aggregated. In particular, the visual features f extracted in the feature extraction module v And an auditory feature f a As an input, their affinity matrix is solved along the time dimension after linear projection as follows:
Figure BDA0003817872100000061
wherein,
Figure BDA0003817872100000062
an auditory projection matrix of learnable parameters,
Figure BDA0003817872100000063
is a visual projection matrix of learnable parameters, d is the dimension of the feature vector after projection. To M av And M va The audiovisual features are weighted element by element using a weight based on the time difference between them, the formula being:
Figure BDA0003817872100000064
wherein relu (·) represents a relu function for filtering out negatively correlated pairs of audiovisual features; softmax (·) denotes a softmax function for normalizing each row in the matrix;
Figure BDA0003817872100000071
representing a Hadamard product, enabling weighting of each pair of audiovisual features; w av And W va Are respectively affinity matrices M av And M va A corresponding weighting matrix. With W av For example, the calculation formula is:
Figure BDA0003817872100000072
wherein i 'and j' respectively index the auditory feature at the i 'th time and the visual feature at the j' th time (the absolute value difference between i 'and j' represents the time difference between the visual feature and the auditory feature), and similarly represents the correlation between the two features at M av The position of (a); exp (·) is an exponential function; theta is a settable hyper-parameter used for balancing the influence of the time difference between the feature pairs and the similarity between the feature pairs on the weighting operation, and can directly influence the final effect of the audio-visual fusion module. Then, the characteristics of the visual and auditory modalities are respectively updated by utilizing the corrected affinity matrix, and the characteristics of different modalities are added to complete the fusion of the information between the modalities, wherein the calculation formula is as follows:
f′ a =W a 3 (M′ av (W a 2 f a )+f v )
f′ v =W v 3 (M′ va (W v 2 f v )+f a )
wherein, W a 2 And W a 3 Auditory projection matrix, W, of learnable parameters v 2 And W v 3 Are visual projection matrices of learnable parameters.
And after the cross-modal information fusion is completed, adding the fused visual features and the fused auditory features together, and sending the obtained mixture into a classification module. In this embodiment, the classification module is a 3-layer perceptron, and the final classification result calculation formula is as follows:
p=MLP(layernorm(f′ a )+layernorm(f′ v ))
where layerorm (·) is a layer normalization function. For a video divided into T segments, the output prediction p dimension is R TxC I.e. to which type of event each fragment belongs or to the background, respectively. In the process of training the fusion module and the classification layer, the loss function is the cross entropy between the output prediction p and the video real label gt.
To verify the effectiveness of this embodiment, a model was constructed based on the pytorch platform and experiments were performed on the AVE dataset.
Experimental data: the AVE data set includes 4143 videos and covers 28 categories of audiovisual events including areas such as musical instrument performance, human behavior, vehicle activity, and animal activity. The data set is divided by default into three subsets: the training set, the verification set and the test set respectively comprise 3339, 402 and 402 videos. The length of each video is 10 seconds, and in the experiment, the picture and the sound of each video are divided into 10 continuous 1s segments with equal length for training the model. In the training phase (including the first phase and the second phase in fig. 2), the parameter training of the model is performed only by using the training set, and the overfitting degree of the model is monitored when the fusion module is trained by using the validation set; in the verification phase, the model makes inferences over the test set.
The model evaluation method comprises the following steps: when locating audiovisual events on an AVE data set, the prediction accuracy for all segments of all test videos is taken as an indicator, following the evaluation of previous algorithms. Because the model in the embodiment is no longer formed in an end-to-end manner, when the feature extraction module is trained in the self-supervision manner in the first stage, in order to measure the quality of the extracted feature, the feature is directly classified and predicted by using the classification layer on the premise of not passing through the audiovisual fusion module of the audiovisual fusion module in the second stage (namely, the feature is directly used for an audiovisual event positioning task without further cross-modal information fusion), and the quality of the self-supervision feature is measured by using the prediction accuracy. The audio-visual fusion module in the second stage takes the self-supervision characteristics learned in the first stage as input, and directly adopts the real labels of the data set for supervision during training, so that the classification prediction obtained after the self-supervision characteristics are fused is more accurate, and the comparison with the existing algorithm is also more fair.
Details of the model: a preprocessing section extracting 160 frames of RGB images having a resolution of 256x256 at a frame rate of 16 for each 10s video; a10 s fixed length sound signal is extracted at a sampling rate of 16KHz, and divided into 10 segments of 1s length, each segment is called short time Fourier transform, and a logarithmic Mel spectrum with a dimension of 96x64 is obtained by a Mel filter. In the feature extraction module, CNN for processing visual signals adopts ResNet-50 pre-trained on ImageNet; the CNN processing the log-mel spectrum adopts VGGish pre-trained on an AudioSet, the projection layers are 3 linear projection layers, and the dimensions are 2048, 4096 and 4096 in sequence from input to output. In the audiovisual fusion module, θ in the weighting matrix is set to 0.03.
Training of the model: loss function L in the process of self-supervision training feature extraction module AVBT The parameter λ of (2) is set to 5e-3, the model is trained using the LARS optimizer for 100 cycles (the first 10 cycles are used for learning rate warm-up, the last 90 cycles are cosine attenuated), the number of samples per training batch is 128, and the basic learning rate is 2e-4. When the audio-visual fusion module is trained, the initial learning rate is 1e-4, the learning rate is attenuated to 0.98 times after each 10 periods, the size of a training batch is 128, and an Adam optimizer is used.
The experimental results are as follows: in the self-supervision feature extraction part of the feature extraction module, the method is mainly compared with the most popular multi-modal feature extraction method AVEL adopted in the existing algorithm, and the method surpasses the existing method no matter in the condition of inputting single-modal data or multi-modal data, as shown in the table 1.
TABLE 1 quality of the self-supervision characteristics of the process compared to the existing process
Method Accuracy (%)
AVEL (Audio input only) 59.5
AVEL (Picture only input) 55.3
AVEL (Audio-visual input) 71.3
Method (Audio input only) 63.5
Method (only picture input) 61.0
Method (Audio-visual input) 75.6
In the information fusion part, based on the self-supervision characteristics in the method, after cross-modal information is further fused through the audio-visual fusion module, the result is compared with the existing algorithm, and as shown in table 2, the accuracy of the method is highest.
TABLE 2 comparison of the present method after fusing the self-supervision feature with the existing algorithm
Model (model) Accuracy (%)
AVSDN 72.6
DMRN 73.1
CMAN 73.3
DAM 74.5
AVRB 74.8
AVIN 75.2
MPN 75.2
JCAN 76.2
PSP 76.6
AVT 76.8
Method for producing a composite material 77.2
In the process of extracting the visual-auditory characteristics, the invention realizes the preliminary information exchange between the modalities and reduces the heterogeneity difference between the visual information and the auditory information; in the process of audio-visual information fusion, the dependence on additional manual supervision of the background is reduced, cross-modal information is further fused by aggregating visual and auditory features carrying similar semantic information on the basis of feature extraction, and the identification accuracy of the audio-visual event is improved by the technical means.

Claims (7)

1. An audio-visual event localization method fused with self-supervision multi-modal features is characterized by comprising the following steps:
acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal; inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data;
the audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected; the visual-auditory feature extraction module and the audio-visual fusion module are mutually independent; the visual-auditory feature extraction module is used for respectively extracting space-time features of the image signal and the sound signal by using CNN and Bi-LSTM to obtain visual features and auditory features; the audio-visual fusion module calculates the similarity between asynchronous visual features and auditory features based on cosine distance, and corrects the similarity of feature pairs according to the law of correlation attenuation in time and then fuses the features; and the classification module classifies the video data based on the fused visual features and auditory features to obtain the event category of each moment in the target video data.
2. The method for audio-visual event localization with fusion of self-supervision multi-modal features according to claim 1, wherein the pre-processing of the target video data is specifically:
dividing the acquired target video data into a plurality of equal-length segments, wherein each segment comprises synchronous image data and sound data;
randomly extracting a frame of picture from each section of image data and carrying out random picture cropping and Gaussian blur on the frame of picture to obtain an image frame signal; converting each section of voice data into a log-mel spectrum to obtain a voice spectrum signal;
all the image frame signals and the sound spectrum signals are arranged according to the front-back sequence of time to obtain image signals and sound signals.
3. The audio-visual event localization method based on fusion self-supervision multi-modal features according to claim 1, wherein the visual-auditory feature extraction module comprises a visual extraction part, an auditory extraction part, a visual projection layer, an auditory projection layer and a cross-correlation matrix unit, wherein the visual extraction part has the same structure as the auditory extraction part and comprises sequentially connected CNN and Bi-LSTM; the input of the vision extraction part is the image signal and is used for extracting the visual characteristics, and the input of the auditory extraction part is the sound signal and is used for extracting the auditory characteristics; the visual projection layer is used for mapping the visual features to a higher-dimensional semantic space to obtain high-dimensional visual features; the auditory projection layer is used for mapping the auditory features to a higher-dimensional semantic space to obtain high-dimensional auditory features; the cross-correlation matrix unit is used for solving a cross-correlation matrix between the high-dimensional visual features and the high-dimensional auditory features in a time dimension.
4. The method of claim 3, wherein the visual-auditory feature extraction module has a loss function of:
Figure FDA0003817872090000011
wherein, C ij Represents the ith row and jth column element in the cross-correlation matrix, and λ is a hyperparameter to balance the importance of diagonal elements with non-diagonal elements.
5. The method for locating an audiovisual event fusing self-supervised multi-modal features of claim 1, wherein the audiovisual fusing module comprises a visual projection matrix part, an auditory projection matrix part, a cross-modal affinity matrix part, a visual update part, an auditory update part and an audiovisual fusing part; the visual projection matrix part is used for performing linear projection of learnable parameters on the visual features, and the auditory projection matrix part is used for performing linear projection of learnable parameters on the auditory features; the cross-modal affinity matrix part is used for solving affinity matrices of the projected visual features and the projected auditory features in a time dimension, and correcting the affinity matrices by adopting a weighting matrix; the vision updating part is used for updating the vision characteristics based on the modified affinity matrix, and the auditory updating part is used for updating the auditory characteristics based on the modified affinity matrix; and the audio-visual fusion part is used for fusing the updated visual features and the auditory features.
6. The method of audio-visual event localization with fusion of self-supervised multimodal features according to claim 5, wherein the cross-modal affinity matrix is based in part on
Figure FDA0003817872090000021
M va =M av T Finding an affinity matrix of the projected visual and auditory features by
Figure FDA0003817872090000022
Element-wise weighting the affinity matrix, wherein M av And M va Is an affinity matrix, and the affinity matrix is a molecular structure,
Figure FDA0003817872090000023
auditory projection matrix, W, for learnable parameters v 1 A visual projection matrix of learnable parameters, d a dimension of the feature vector after projection, f a As a feature of hearing, f v Being a visual feature, M' av And M' va For the modified affinity matrix, W av And W va Are affinity matrices M, respectively av And M va A corresponding weighting matrix is set to be a weight,
Figure FDA0003817872090000029
denotes the Hadamard product, softmax (. Cndot.) denotes the softmax function, and relu (. Cndot.) denotes the relu function.
7. The method of fusing audiovisual event localization of an auto-supervised multimodal feature of claim 6, wherein the audiovisual fusion component is performed by
Figure FDA0003817872090000024
Fusion of information between modalities is accomplished, wherein,
Figure FDA0003817872090000025
and
Figure FDA0003817872090000026
are an auditory projection matrix of learnable parameters,
Figure FDA0003817872090000027
and
Figure FDA0003817872090000028
are visual projection matrices of learnable parameters.
CN202211032147.XA 2022-08-26 2022-08-26 Audio-visual event positioning method fusing self-supervision multi-mode features Pending CN115393968A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211032147.XA CN115393968A (en) 2022-08-26 2022-08-26 Audio-visual event positioning method fusing self-supervision multi-mode features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211032147.XA CN115393968A (en) 2022-08-26 2022-08-26 Audio-visual event positioning method fusing self-supervision multi-mode features

Publications (1)

Publication Number Publication Date
CN115393968A true CN115393968A (en) 2022-11-25

Family

ID=84121813

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211032147.XA Pending CN115393968A (en) 2022-08-26 2022-08-26 Audio-visual event positioning method fusing self-supervision multi-mode features

Country Status (1)

Country Link
CN (1) CN115393968A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310975A (en) * 2023-03-14 2023-06-23 北京邮电大学 Audiovisual event positioning method based on consistent fragment selection
CN117037046A (en) * 2023-10-08 2023-11-10 之江实验室 Audio-visual event detection method and device, storage medium and electronic equipment
CN117611255A (en) * 2023-11-07 2024-02-27 北京创信众科技有限公司 Advertisement operation method and system based on big data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116310975A (en) * 2023-03-14 2023-06-23 北京邮电大学 Audiovisual event positioning method based on consistent fragment selection
CN117037046A (en) * 2023-10-08 2023-11-10 之江实验室 Audio-visual event detection method and device, storage medium and electronic equipment
CN117037046B (en) * 2023-10-08 2024-01-09 之江实验室 Audio-visual event detection method and device, storage medium and electronic equipment
CN117611255A (en) * 2023-11-07 2024-02-27 北京创信众科技有限公司 Advertisement operation method and system based on big data

Similar Documents

Publication Publication Date Title
Niu et al. Multimodal spatiotemporal representation for automatic depression level detection
CN115393968A (en) Audio-visual event positioning method fusing self-supervision multi-mode features
CN112818861B (en) Emotion classification method and system based on multi-mode context semantic features
CN110033756B (en) Language identification method and device, electronic equipment and storage medium
CN110516536A (en) A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification
CN111079594B (en) Video action classification and identification method based on double-flow cooperative network
CN112487949B (en) Learner behavior recognition method based on multi-mode data fusion
CN116171473A (en) Bimodal relationship network for audio-visual event localization
CN114519809A (en) Audio-visual video analysis device and method based on multi-scale semantic network
CN113723166A (en) Content identification method and device, computer equipment and storage medium
CN110765854A (en) Video motion recognition method
CN112541529A (en) Expression and posture fusion bimodal teaching evaluation method, device and storage medium
CN109190521B (en) Construction method and application of face recognition model based on knowledge purification
CN115862684A (en) Audio-based depression state auxiliary detection method for dual-mode fusion type neural network
CN114220458B (en) Voice recognition method and device based on array hydrophone
CN111986699A (en) Sound event detection method based on full convolution network
CN111932056A (en) Customer service quality scoring method and device, computer equipment and storage medium
CN110874576A (en) Pedestrian re-identification method based on canonical correlation analysis fusion features
CN115147641A (en) Video classification method based on knowledge distillation and multi-mode fusion
Cai et al. TDCA-Net: Time-Domain Channel Attention Network for Depression Detection.
CN115471771A (en) Video time sequence action positioning method based on semantic level time sequence correlation modeling
CN116049557A (en) Educational resource recommendation method based on multi-mode pre-training model
CN117763446B (en) Multi-mode emotion recognition method and device
CN116701996A (en) Multi-modal emotion analysis method, system, equipment and medium based on multiple loss functions
Lu et al. Temporal Attentive Pooling for Acoustic Event Detection.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination