CN115393968A - Audio-visual event positioning method fusing self-supervision multi-mode features - Google Patents
Audio-visual event positioning method fusing self-supervision multi-mode features Download PDFInfo
- Publication number
- CN115393968A CN115393968A CN202211032147.XA CN202211032147A CN115393968A CN 115393968 A CN115393968 A CN 115393968A CN 202211032147 A CN202211032147 A CN 202211032147A CN 115393968 A CN115393968 A CN 115393968A
- Authority
- CN
- China
- Prior art keywords
- visual
- auditory
- features
- matrix
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000000605 extraction Methods 0.000 claims abstract description 55
- 230000004927 fusion Effects 0.000 claims abstract description 46
- 230000005236 sound signal Effects 0.000 claims abstract description 19
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 230000000007 visual effect Effects 0.000 claims description 89
- 239000011159 matrix material Substances 0.000 claims description 70
- 230000006870 function Effects 0.000 claims description 15
- 238000001228 spectrum Methods 0.000 claims description 11
- 230000004807 localization Effects 0.000 claims description 8
- 238000013507 mapping Methods 0.000 claims description 7
- 230000001360 synchronised effect Effects 0.000 claims description 5
- 238000012549 training Methods 0.000 description 16
- 230000008569 process Effects 0.000 description 11
- 238000004422 calculation algorithm Methods 0.000 description 10
- 230000000694 effects Effects 0.000 description 4
- 230000004931 aggregating effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 230000002238 attenuated effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000875 corresponding effect Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000013215 result calculation Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/70—Multimodal biometrics, e.g. combining information from different biometric modalities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to an audio-visual event positioning method fused with self-supervision multi-mode characteristics, which comprises the following steps: acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal; inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data; the audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected, wherein the visual-auditory feature extraction module and the audiovisual fusion module are mutually independent. The invention can improve the identification accuracy of the audiovisual event.
Description
Technical Field
The invention relates to the technical field of audio-visual event positioning, in particular to an audio-visual event positioning method fusing self-supervision multi-mode characteristics.
Background
Humans perceive the surrounding environment through a variety of different senses (e.g., vision, hearing, touch, smell, etc.). While machine learning has made dramatic advances in single modality tasks such as image classification (vision), speech recognition (hearing), and natural language processing (text) at present, to better teach machines to mimic the way humans perceive the real world, many tasks are being proposed that require the coordinated utilization of multiple modality data. These multi-modal tasks are closer to the practical application scenario and research thereon is increasing.
Audiovisual event localization based on both visual and auditory signals is an important application in the field of multimodal video understanding and analysis. For a given piece of video, the algorithm needs to determine when an audiovisual event of interest to a person has occurred therein, and to determine the type of event that has occurred. Because the audiovisual event has both visual and auditory attributes, the information of the image data and the audio data in the video can be fully utilized to more efficiently and accurately position the audiovisual event. On the one hand, although multi-modal data provides more useful information about video content than single modalities, this heterogeneity makes it difficult to fuse information of different modalities due to the natural difference in the way visual and auditory signals are composed. On the other hand, the multi-modal input signal also brings more noise. In addition, the algorithm also needs to process the incoming signal and correctly judge it as background when no audiovisual event of interest has occurred. These backgrounds are often not visually and audibly distinctive and distinctive features of a particular event, making their identification more difficult. In response to these problems, the existing methods generally integrate visual and auditory feature extraction and fusion of two modality information into an end-to-end model, and mainly focus on designing a specific attention module to model the cross-modality relationship during the fusion process. However, in the existing algorithms, there are the following main limitations:
(1) Information exchange between different modalities in the feature extraction stage is often neglected, which can prevent subsequent sufficient information fusion.
(2) In the process of information fusion, only modeling of cross-modal relationships under certain special conditions is considered, so that the algorithm is lack of generality on the whole.
(3) Additional modules or manual supervision are relied upon to identify the background alone when no audiovisual event of interest occurs. These additional modules/manual supervision make the algorithm much less robust in practical applications.
Disclosure of Invention
The invention aims to solve the technical problem of providing an audio-visual event positioning method fusing self-supervision multi-mode characteristics, which can improve the identification accuracy of audio-visual events.
The technical scheme adopted by the invention for solving the technical problems is as follows: the method for positioning the audio-visual event fused with the self-supervision multi-modal features comprises the following steps:
acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal;
inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data;
the audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected; the visual-auditory feature extraction module and the audio-visual fusion module are mutually independent; the visual-auditory feature extraction module is used for respectively extracting space-time features of the image signal and the sound signal by using CNN and Bi-LSTM to obtain visual features and auditory features; the audio-visual fusion module calculates the similarity between asynchronous visual features and auditory features based on cosine distance, and corrects the similarity of feature pairs according to the law of correlation decay in time and then fuses the features; and the classification module classifies based on the fused visual features and auditory features to obtain the event category of each moment in the target video data.
The preprocessing of the target video data specifically comprises:
dividing the acquired target video data into a plurality of equal-length segments, wherein each segment comprises synchronous image data and sound data;
randomly extracting a frame of picture from each section of image data and carrying out random picture cropping and Gaussian blur on the frame of picture to obtain an image frame signal; converting each section of sound data into a log-mel spectrum to obtain a sound spectrum signal;
all the image frame signals and the sound spectrum signals are arranged according to the front-back sequence of time to obtain image signals and sound signals.
The visual-auditory feature extraction module comprises a visual extraction part, an auditory extraction part, a visual projection layer, an auditory projection layer and a cross-correlation matrix unit, wherein the visual extraction part has the same structure as the auditory extraction part and comprises a CNN (CNN) and a Bi-LSTM (Bi-LSTM) which are sequentially connected; the input of the vision extraction part is the image signal and is used for extracting the visual characteristics, and the input of the auditory extraction part is the sound signal and is used for extracting the auditory characteristics; the visual projection layer is used for mapping the visual features to a higher-dimensional semantic space to obtain high-dimensional visual features; the auditory projection layer is used for mapping the auditory features to a semantic space with higher dimensionality to obtain high-dimensional auditory features; the cross-correlation matrix unit is used for solving a cross-correlation matrix between the high-dimensional visual features and the high-dimensional auditory features in a time dimension.
The loss function of the visual-auditory feature extraction module is:wherein, C ij Represents the ith row and jth column elements in the cross-correlation matrix, and λ is a hyperparameter to balance the importance of diagonal elements with non-diagonal elements.
The audio-visual fusion module comprises a visual projection matrix part, an auditory projection matrix part, a cross-modal affinity matrix part, a visual updating part, an auditory updating part and an audio-visual fusion part; the visual projection matrix part is used for performing linear projection of learnable parameters on the visual features, and the auditory projection matrix part is used for performing linear projection of learnable parameters on the auditory features; the cross-modal affinity matrix part is used for solving affinity matrixes of the projected visual characteristics and the projected auditory characteristics in time dimension, and correcting the affinity matrixes by adopting a weighting matrix; the vision updating part is used for updating the vision characteristics based on the modified affinity matrix, and the auditory updating part is used for updating the auditory characteristics based on the modified affinity matrix; and the audio-visual fusion part is used for fusing the updated visual features and the auditory features.
The cross-modal affinity matrix is partially passed throughM va =M av T Finding an affinity matrix of the projected visual and auditory signatures byElement-wise weighting the affinity matrix, wherein M av And M va In the form of an affinity matrix, the affinity matrix,an auditory projection matrix of learnable parameters,a visual projection matrix of learnable parameters, d a dimension of the feature vector after projection, f a As a characteristic of hearing, f v Is visual characteristic, M' av And M' va For the modified affinity matrix, W av And W va Are affinity matrices M, respectively av And M va A corresponding weighting matrix is set to be a weight,denotes the Hadamard product, softmax (. Cndot.) denotes the softmax function, and relu (. Cndot.) denotes the relu function.
The audio-visual fusion part passesFusion of information between modalities is accomplished, wherein,andauditory projection matrix, W, of learnable parameters v 2 And W v 3 Are visual projection matrices of learnable parameters.
Advantageous effects
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: in the process of extracting the visual-auditory characteristics, the invention realizes preliminary information exchange between modalities and reduces the heterogeneity difference between the visual information and the auditory information; in the process of audio-visual information fusion, the dependence on additional manual supervision of the background is reduced, cross-modal information is further fused by aggregating visual and auditory features carrying similar semantic information on the basis of feature extraction, and the identification accuracy of the audio-visual event is improved by the technical means.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention;
FIG. 2 is a schematic diagram of an audiovisual event localization model in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a visual-auditory feature extraction module in an embodiment of the invention;
fig. 4 is a schematic diagram of an audiovisual fusion module in an embodiment of the invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The embodiment of the invention relates to an audio-visual event positioning method fused with self-supervision multi-modal characteristics, which comprises the following steps as shown in figure 1: acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal; and inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data.
The audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected; the visual-auditory characteristic extraction module and the audio-visual fusion module are mutually independent; the visual-auditory feature extraction module is used for respectively extracting space-time features of the image signal and the sound signal by using CNN and Bi-LSTM to obtain visual features and auditory features; the audio-visual fusion module calculates the similarity between asynchronous visual features and auditory features based on cosine distance, and corrects the similarity of feature pairs according to the law of correlation attenuation in time and then fuses the features; and the classification module classifies based on the fused visual features and auditory features to obtain the event category of each moment in the target video data.
In the visual-auditory feature extraction process, the embodiment realizes preliminary information exchange between modalities and reduces the heterogeneity difference between visual information and auditory information; in the process of audio-visual information fusion, the dependence on additional manual supervision of the background is reduced, and cross-modal information is further fused by aggregating similar visual and auditory features on the basis of feature extraction. As shown in fig. 2, the audiovisual event localization model in this embodiment is no longer end-to-end as a whole, but is divided into two stages: the feature extraction and the information fusion can be mutually promoted in the two stages, and the efficiency of event positioning is greatly improved.
In the embodiment, the visual-auditory feature extraction module comprises a visual extraction part, an auditory extraction part, a visual projection layer, an auditory projection layer and a cross-correlation matrix unit, wherein the visual extraction part has the same structure as the auditory extraction part and comprises a CNN (CNN) and a Bi-LSTM (Bi-LSTM) which are sequentially connected; the input of the vision extraction part is the image signal and is used for extracting the visual characteristics, and the input of the auditory extraction part is the sound signal and is used for extracting the auditory characteristics; the visual projection layer is used for mapping the visual features to a semantic space with higher dimensionality to obtain high-dimensional visual features; the auditory projection layer is used for mapping the auditory features to a semantic space with higher dimensionality to obtain high-dimensional auditory features; the cross-correlation matrix unit is used for solving a cross-correlation matrix between the high-dimensional visual features and the high-dimensional auditory features in a time dimension.
In the first stage, the feature extraction module is trained in an unsupervised manner, as shown in fig. 3. First, as for an input video S, it is divided into T equal-length segments, and each segment S t Involving synchronized image data V t And sound data A t . During the preprocessing stage of each cycle of the training module, each image data V is processed t Randomly extracting a frame of picture and performing random picture cropping and Gaussian blur on the frame of picture to obtain a frame of picture signal t (ii) a Each sound data A t Converting into log-mel spectrum to obtain sound spectrum signal spec t . All image frame signals are converted into frame signals t And the sound spectrum signal spec t The pre-processed results, i.e. the image signal frame and the sound signal spec, are obtained by arranging them in chronological order and then sent to the feature extraction module. The CNN and the Bi-LSTM are sequentially used in a feature extraction module to extract space-time features of visual and auditory signals to obtain visual features f v And an auditory feature f a . Then the visual features f are respectively transmitted through two projection layers v And an auditory feature f a Mapping to higher-dimensional semantic space to obtain high-dimensional visual feature z v And high dimensional auditory feature z a 。
Since the synchronous visual and auditory signals are generated simultaneously by the same event, they can be regarded as two different expressions of the same event, and naturally have extremely high correlation. In order to enable the feature extraction module to learn the high-quality audio-visual combined features, the embodiment provides a new self-supervision loss function. To calculate the loss function, the visual feature z is first taken in the high dimension v And high-dimensional hearingCharacteristic z a Find the cross-correlation matrix C between them in the time dimension, the ith row and the jth column elements C in the cross-correlation matrix C ij The definition is as follows:
t represents the video-audio segment from t, i and j represent the auditory feature z in high dimension a And a high dimensional visual feature z v The middle element corresponds to the position. Then the difference between the cross correlation matrix and the unit diagonal matrix (same dimension) is reduced, and finally the loss function L AVBT The following were used:
where λ is a hyper-parameter that balances the importance of diagonal elements versus non-diagonal elements. Minimizing L during training AVBT The feature extraction module can learn how to extract high-quality features, heterogeneous differences among different modal information are reduced by improving semantic similarity of audio-visual features, and information is preliminarily exchanged among synchronous visual and audio signals.
It should be noted that, after the training of the feature extraction module in the first stage is finished, the parameters in the feature module are frozen, and are used to extract the fixed visual features f for the video in the data set v And an auditory feature f a For subsequent further cross-modality information fusion.
In this embodiment, the audiovisual fusion module includes a visual projection matrix part, an auditory projection matrix part, a cross-modal affinity matrix part, a visual update part, an auditory update part, and an audiovisual fusion part; the visual projection matrix part is used for performing linear projection of learnable parameters on the visual features, and the auditory projection matrix part is used for performing linear projection of scientific system parameters on the auditory features; the cross-modal affinity matrix part is used for solving affinity matrices of the projected visual features and the projected auditory features in a time dimension, and correcting the affinity matrices by adopting a weighting matrix; the vision updating part is used for updating the vision characteristics based on the modified affinity matrix, and the auditory updating part is used for updating the auditory characteristics based on the modified Qinghe matrix; and the audio-visual fusion part is used for fusing the updated visual features and the auditory features.
In the second stage, the audiovisual fusion module is trained on the basis of the visual features and the auditory features obtained by the training in the first stage and under the supervision of the labels of the training samples, so as to further fuse cross-modal information, as shown in fig. 4. In this stage, the main purpose is to calculate the similarity between asynchronous visual range and auditory feature based on cosine distance, and according to the general rule of correlation decay in time (i.e. the more distant visual and auditory signals have weaker correlation), the similarity of feature pairs is corrected and then the features with high similarity are aggregated. In particular, the visual features f extracted in the feature extraction module v And an auditory feature f a As an input, their affinity matrix is solved along the time dimension after linear projection as follows:
wherein,an auditory projection matrix of learnable parameters,is a visual projection matrix of learnable parameters, d is the dimension of the feature vector after projection. To M av And M va The audiovisual features are weighted element by element using a weight based on the time difference between them, the formula being:
wherein relu (·) represents a relu function for filtering out negatively correlated pairs of audiovisual features; softmax (·) denotes a softmax function for normalizing each row in the matrix;representing a Hadamard product, enabling weighting of each pair of audiovisual features; w av And W va Are respectively affinity matrices M av And M va A corresponding weighting matrix. With W av For example, the calculation formula is:
wherein i 'and j' respectively index the auditory feature at the i 'th time and the visual feature at the j' th time (the absolute value difference between i 'and j' represents the time difference between the visual feature and the auditory feature), and similarly represents the correlation between the two features at M av The position of (a); exp (·) is an exponential function; theta is a settable hyper-parameter used for balancing the influence of the time difference between the feature pairs and the similarity between the feature pairs on the weighting operation, and can directly influence the final effect of the audio-visual fusion module. Then, the characteristics of the visual and auditory modalities are respectively updated by utilizing the corrected affinity matrix, and the characteristics of different modalities are added to complete the fusion of the information between the modalities, wherein the calculation formula is as follows:
f′ a =W a 3 (M′ av (W a 2 f a )+f v )
f′ v =W v 3 (M′ va (W v 2 f v )+f a )
wherein, W a 2 And W a 3 Auditory projection matrix, W, of learnable parameters v 2 And W v 3 Are visual projection matrices of learnable parameters.
And after the cross-modal information fusion is completed, adding the fused visual features and the fused auditory features together, and sending the obtained mixture into a classification module. In this embodiment, the classification module is a 3-layer perceptron, and the final classification result calculation formula is as follows:
p=MLP(layernorm(f′ a )+layernorm(f′ v ))
where layerorm (·) is a layer normalization function. For a video divided into T segments, the output prediction p dimension is R TxC I.e. to which type of event each fragment belongs or to the background, respectively. In the process of training the fusion module and the classification layer, the loss function is the cross entropy between the output prediction p and the video real label gt.
To verify the effectiveness of this embodiment, a model was constructed based on the pytorch platform and experiments were performed on the AVE dataset.
Experimental data: the AVE data set includes 4143 videos and covers 28 categories of audiovisual events including areas such as musical instrument performance, human behavior, vehicle activity, and animal activity. The data set is divided by default into three subsets: the training set, the verification set and the test set respectively comprise 3339, 402 and 402 videos. The length of each video is 10 seconds, and in the experiment, the picture and the sound of each video are divided into 10 continuous 1s segments with equal length for training the model. In the training phase (including the first phase and the second phase in fig. 2), the parameter training of the model is performed only by using the training set, and the overfitting degree of the model is monitored when the fusion module is trained by using the validation set; in the verification phase, the model makes inferences over the test set.
The model evaluation method comprises the following steps: when locating audiovisual events on an AVE data set, the prediction accuracy for all segments of all test videos is taken as an indicator, following the evaluation of previous algorithms. Because the model in the embodiment is no longer formed in an end-to-end manner, when the feature extraction module is trained in the self-supervision manner in the first stage, in order to measure the quality of the extracted feature, the feature is directly classified and predicted by using the classification layer on the premise of not passing through the audiovisual fusion module of the audiovisual fusion module in the second stage (namely, the feature is directly used for an audiovisual event positioning task without further cross-modal information fusion), and the quality of the self-supervision feature is measured by using the prediction accuracy. The audio-visual fusion module in the second stage takes the self-supervision characteristics learned in the first stage as input, and directly adopts the real labels of the data set for supervision during training, so that the classification prediction obtained after the self-supervision characteristics are fused is more accurate, and the comparison with the existing algorithm is also more fair.
Details of the model: a preprocessing section extracting 160 frames of RGB images having a resolution of 256x256 at a frame rate of 16 for each 10s video; a10 s fixed length sound signal is extracted at a sampling rate of 16KHz, and divided into 10 segments of 1s length, each segment is called short time Fourier transform, and a logarithmic Mel spectrum with a dimension of 96x64 is obtained by a Mel filter. In the feature extraction module, CNN for processing visual signals adopts ResNet-50 pre-trained on ImageNet; the CNN processing the log-mel spectrum adopts VGGish pre-trained on an AudioSet, the projection layers are 3 linear projection layers, and the dimensions are 2048, 4096 and 4096 in sequence from input to output. In the audiovisual fusion module, θ in the weighting matrix is set to 0.03.
Training of the model: loss function L in the process of self-supervision training feature extraction module AVBT The parameter λ of (2) is set to 5e-3, the model is trained using the LARS optimizer for 100 cycles (the first 10 cycles are used for learning rate warm-up, the last 90 cycles are cosine attenuated), the number of samples per training batch is 128, and the basic learning rate is 2e-4. When the audio-visual fusion module is trained, the initial learning rate is 1e-4, the learning rate is attenuated to 0.98 times after each 10 periods, the size of a training batch is 128, and an Adam optimizer is used.
The experimental results are as follows: in the self-supervision feature extraction part of the feature extraction module, the method is mainly compared with the most popular multi-modal feature extraction method AVEL adopted in the existing algorithm, and the method surpasses the existing method no matter in the condition of inputting single-modal data or multi-modal data, as shown in the table 1.
TABLE 1 quality of the self-supervision characteristics of the process compared to the existing process
Method | Accuracy (%) |
AVEL (Audio input only) | 59.5 |
AVEL (Picture only input) | 55.3 |
AVEL (Audio-visual input) | 71.3 |
Method (Audio input only) | 63.5 |
Method (only picture input) | 61.0 |
Method (Audio-visual input) | 75.6 |
In the information fusion part, based on the self-supervision characteristics in the method, after cross-modal information is further fused through the audio-visual fusion module, the result is compared with the existing algorithm, and as shown in table 2, the accuracy of the method is highest.
TABLE 2 comparison of the present method after fusing the self-supervision feature with the existing algorithm
Model (model) | Accuracy (%) |
AVSDN | 72.6 |
DMRN | 73.1 |
CMAN | 73.3 |
DAM | 74.5 |
AVRB | 74.8 |
AVIN | 75.2 |
MPN | 75.2 |
JCAN | 76.2 |
PSP | 76.6 |
AVT | 76.8 |
Method for producing a composite material | 77.2 |
In the process of extracting the visual-auditory characteristics, the invention realizes the preliminary information exchange between the modalities and reduces the heterogeneity difference between the visual information and the auditory information; in the process of audio-visual information fusion, the dependence on additional manual supervision of the background is reduced, cross-modal information is further fused by aggregating visual and auditory features carrying similar semantic information on the basis of feature extraction, and the identification accuracy of the audio-visual event is improved by the technical means.
Claims (7)
1. An audio-visual event localization method fused with self-supervision multi-modal features is characterized by comprising the following steps:
acquiring target video data, and preprocessing the target video data to obtain an image signal and a sound signal; inputting the image signal and the sound signal into an audio-visual event positioning model for identification and positioning to obtain the event category of each moment in the target video data;
the audiovisual event positioning model comprises a visual-auditory feature extraction module, an audiovisual fusion module and a classification module which are sequentially connected; the visual-auditory feature extraction module and the audio-visual fusion module are mutually independent; the visual-auditory feature extraction module is used for respectively extracting space-time features of the image signal and the sound signal by using CNN and Bi-LSTM to obtain visual features and auditory features; the audio-visual fusion module calculates the similarity between asynchronous visual features and auditory features based on cosine distance, and corrects the similarity of feature pairs according to the law of correlation attenuation in time and then fuses the features; and the classification module classifies the video data based on the fused visual features and auditory features to obtain the event category of each moment in the target video data.
2. The method for audio-visual event localization with fusion of self-supervision multi-modal features according to claim 1, wherein the pre-processing of the target video data is specifically:
dividing the acquired target video data into a plurality of equal-length segments, wherein each segment comprises synchronous image data and sound data;
randomly extracting a frame of picture from each section of image data and carrying out random picture cropping and Gaussian blur on the frame of picture to obtain an image frame signal; converting each section of voice data into a log-mel spectrum to obtain a voice spectrum signal;
all the image frame signals and the sound spectrum signals are arranged according to the front-back sequence of time to obtain image signals and sound signals.
3. The audio-visual event localization method based on fusion self-supervision multi-modal features according to claim 1, wherein the visual-auditory feature extraction module comprises a visual extraction part, an auditory extraction part, a visual projection layer, an auditory projection layer and a cross-correlation matrix unit, wherein the visual extraction part has the same structure as the auditory extraction part and comprises sequentially connected CNN and Bi-LSTM; the input of the vision extraction part is the image signal and is used for extracting the visual characteristics, and the input of the auditory extraction part is the sound signal and is used for extracting the auditory characteristics; the visual projection layer is used for mapping the visual features to a higher-dimensional semantic space to obtain high-dimensional visual features; the auditory projection layer is used for mapping the auditory features to a higher-dimensional semantic space to obtain high-dimensional auditory features; the cross-correlation matrix unit is used for solving a cross-correlation matrix between the high-dimensional visual features and the high-dimensional auditory features in a time dimension.
5. The method for locating an audiovisual event fusing self-supervised multi-modal features of claim 1, wherein the audiovisual fusing module comprises a visual projection matrix part, an auditory projection matrix part, a cross-modal affinity matrix part, a visual update part, an auditory update part and an audiovisual fusing part; the visual projection matrix part is used for performing linear projection of learnable parameters on the visual features, and the auditory projection matrix part is used for performing linear projection of learnable parameters on the auditory features; the cross-modal affinity matrix part is used for solving affinity matrices of the projected visual features and the projected auditory features in a time dimension, and correcting the affinity matrices by adopting a weighting matrix; the vision updating part is used for updating the vision characteristics based on the modified affinity matrix, and the auditory updating part is used for updating the auditory characteristics based on the modified affinity matrix; and the audio-visual fusion part is used for fusing the updated visual features and the auditory features.
6. The method of audio-visual event localization with fusion of self-supervised multimodal features according to claim 5, wherein the cross-modal affinity matrix is based in part onM va =M av T Finding an affinity matrix of the projected visual and auditory features byElement-wise weighting the affinity matrix, wherein M av And M va Is an affinity matrix, and the affinity matrix is a molecular structure,auditory projection matrix, W, for learnable parameters v 1 A visual projection matrix of learnable parameters, d a dimension of the feature vector after projection, f a As a feature of hearing, f v Being a visual feature, M' av And M' va For the modified affinity matrix, W av And W va Are affinity matrices M, respectively av And M va A corresponding weighting matrix is set to be a weight,denotes the Hadamard product, softmax (. Cndot.) denotes the softmax function, and relu (. Cndot.) denotes the relu function.
7. The method of fusing audiovisual event localization of an auto-supervised multimodal feature of claim 6, wherein the audiovisual fusion component is performed byFusion of information between modalities is accomplished, wherein,andare an auditory projection matrix of learnable parameters,andare visual projection matrices of learnable parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211032147.XA CN115393968A (en) | 2022-08-26 | 2022-08-26 | Audio-visual event positioning method fusing self-supervision multi-mode features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211032147.XA CN115393968A (en) | 2022-08-26 | 2022-08-26 | Audio-visual event positioning method fusing self-supervision multi-mode features |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115393968A true CN115393968A (en) | 2022-11-25 |
Family
ID=84121813
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211032147.XA Pending CN115393968A (en) | 2022-08-26 | 2022-08-26 | Audio-visual event positioning method fusing self-supervision multi-mode features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115393968A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116310975A (en) * | 2023-03-14 | 2023-06-23 | 北京邮电大学 | Audiovisual event positioning method based on consistent fragment selection |
CN117037046A (en) * | 2023-10-08 | 2023-11-10 | 之江实验室 | Audio-visual event detection method and device, storage medium and electronic equipment |
CN117611255A (en) * | 2023-11-07 | 2024-02-27 | 北京创信众科技有限公司 | Advertisement operation method and system based on big data |
-
2022
- 2022-08-26 CN CN202211032147.XA patent/CN115393968A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116310975A (en) * | 2023-03-14 | 2023-06-23 | 北京邮电大学 | Audiovisual event positioning method based on consistent fragment selection |
CN117037046A (en) * | 2023-10-08 | 2023-11-10 | 之江实验室 | Audio-visual event detection method and device, storage medium and electronic equipment |
CN117037046B (en) * | 2023-10-08 | 2024-01-09 | 之江实验室 | Audio-visual event detection method and device, storage medium and electronic equipment |
CN117611255A (en) * | 2023-11-07 | 2024-02-27 | 北京创信众科技有限公司 | Advertisement operation method and system based on big data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Niu et al. | Multimodal spatiotemporal representation for automatic depression level detection | |
CN115393968A (en) | Audio-visual event positioning method fusing self-supervision multi-mode features | |
CN112818861B (en) | Emotion classification method and system based on multi-mode context semantic features | |
CN110033756B (en) | Language identification method and device, electronic equipment and storage medium | |
CN110516536A (en) | A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification | |
CN111079594B (en) | Video action classification and identification method based on double-flow cooperative network | |
CN112487949B (en) | Learner behavior recognition method based on multi-mode data fusion | |
CN116171473A (en) | Bimodal relationship network for audio-visual event localization | |
CN114519809A (en) | Audio-visual video analysis device and method based on multi-scale semantic network | |
CN113723166A (en) | Content identification method and device, computer equipment and storage medium | |
CN110765854A (en) | Video motion recognition method | |
CN112541529A (en) | Expression and posture fusion bimodal teaching evaluation method, device and storage medium | |
CN109190521B (en) | Construction method and application of face recognition model based on knowledge purification | |
CN115862684A (en) | Audio-based depression state auxiliary detection method for dual-mode fusion type neural network | |
CN114220458B (en) | Voice recognition method and device based on array hydrophone | |
CN111986699A (en) | Sound event detection method based on full convolution network | |
CN111932056A (en) | Customer service quality scoring method and device, computer equipment and storage medium | |
CN110874576A (en) | Pedestrian re-identification method based on canonical correlation analysis fusion features | |
CN115147641A (en) | Video classification method based on knowledge distillation and multi-mode fusion | |
Cai et al. | TDCA-Net: Time-Domain Channel Attention Network for Depression Detection. | |
CN115471771A (en) | Video time sequence action positioning method based on semantic level time sequence correlation modeling | |
CN116049557A (en) | Educational resource recommendation method based on multi-mode pre-training model | |
CN117763446B (en) | Multi-mode emotion recognition method and device | |
CN116701996A (en) | Multi-modal emotion analysis method, system, equipment and medium based on multiple loss functions | |
Lu et al. | Temporal Attentive Pooling for Acoustic Event Detection. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |