CN113807307A

CN113807307A - Multi-mode joint learning method for video multi-behavior recognition

Info

Publication number: CN113807307A
Application number: CN202111143894.6A
Authority: CN
Inventors: 石珍生; 郑海永
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2021-12-17
Anticipated expiration: 2041-09-28
Also published as: CN113807307B

Abstract

The invention relates to the technical field of multi-behavior video identification, and particularly discloses a multi-mode joint learning method for multi-behavior identification of videos, which comprises the following steps: s1, constructing a multi-mode joint learning network, wherein the multi-mode joint learning network comprises a visual mode learning module, an audio mode learning network and a text mode learning network; s2, preprocessing the original video data set to obtain a corresponding visual frame data set, an audio behavior feature dictionary and a text behavior feature dictionary; and S3, inputting the visual frame data set into a visual modal learning module, inputting the audio behavior feature dictionary into an audio modal learning network, and inputting the text behavior feature dictionary into a text modal learning network for joint training to output multi-modal joint behavior prediction of three modes of joint vision, audio and text. Ablation research, multi-behavior relation visualization and promotion analysis show the effectiveness of the multi-modal multi-behavior relation modeling, and the most advanced performance is realized on a large-scale multi-behavior reference data set M-MiT.

Description

Multi-mode joint learning method for video multi-behavior recognition

Technical Field

The invention relates to the technical field of multi-behavior video identification, in particular to a multi-mode joint learning method for multi-behavior video identification.

Background

Multi-behavior video identification is more challenging due to the need to identify multiple behaviors that occur simultaneously or consecutively. Modeling multi-behavior relationships is beneficial and crucial for understanding videos with multiple behaviors, while the behaviors in the video are typically presented in the form of multiple modalities.

Video understanding is a very complex and comprehensive task in computer vision, as it aims to identify activities occurring in complex environments by means of complex audiovisual videos. The activities described in the video are typically composed of several actions that may occur simultaneously or sequentially. For example, when a "show" action occurs, it is often accompanied by "applause" and "cheering" actions. Multi-behavior video recognition is a task aimed at automatically recognizing all behaviors occurring simultaneously in a video. Although considerable progress has been made in behavior recognition, multi-behavior recognition has considerable limitations. To address the task of video recognition of single behaviors, more and more work is exploring the relationships between behaviors and objects from videos. Therefore, in order to identify all behaviors occurring simultaneously in a video to better address the multi-behavior identification problem, it is beneficial and crucial to explore the relationships between multiple behaviors, i.e., multi-behavior relationships.

Recent advances in multi-behavioral video recognition have focused on studying artificially designed and extracted spatio-temporal features to train classifiers, or designing three-dimensional convolutional neural network (3D-CNN) structures to learn high-resolution spatio-temporal representations for classification. However, previous studies have not specifically considered the relationship between multiple behaviors in video. Furthermore, although multi-modal information has been used to analyze multi-behavioral videos, it is only used to extract features of the respective modalities (i.e., spatiotemporal and acoustic features of visual and audio modalities) for fusion classification, rather than exploring multi-modal multi-behavioral relationships to obtain more discriminative characterization information. Therefore, how to fully utilize multi-modal information to better explore multi-behavior relationships is a key to multi-behavior video recognition.

Disclosure of Invention

The invention provides a multi-mode joint learning method for video multi-behavior recognition, which solves the technical problems that: how to fully utilize multi-modal information for multi-modal video recognition.

In order to solve the technical problem, the invention provides a multimodal joint learning method for video multi-behavior recognition, which comprises the following steps:

s1, constructing a multi-mode joint learning network, wherein the multi-mode joint learning network comprises a visual mode learning module, an audio mode learning network and a text mode learning network;

s2, preprocessing the original video data set to obtain a corresponding visual frame data set, an audio behavior feature dictionary and a text behavior feature dictionary;

and S3, inputting the visual frame data set into a visual modal learning module, inputting the audio behavior feature dictionary into an audio modal learning network, and inputting the text behavior feature dictionary into a text modal learning network for joint training to output multi-modal joint behavior prediction of three modes of joint vision, audio and text.

Further, the visual modality learning module comprises a visual feature extraction network and a visual modality learning network; in step S3, the learning process of the visual modality learning module specifically includes the steps of:

s31, the visual feature extraction network performs feature extraction on the input visual frame data set, generates space-time features and broadcasts the space-time features to the visual modal learning network as node features of N behaviors;

and S32, the visual modal learning network enhances the node characteristics of the N behaviors, then averages the N behaviors in the behavior dimension, and outputs the visual modal behavior prediction.

Further, in step S3, the learning process of the audio modality learning network specifically includes the steps of:

s33, the audio mode learning network extracts audio mode multi-behavior relations from the input audio behavior feature dictionary;

and S34, applying the audio mode multi-behavior relation to the space-time characteristics generated by the visual characteristic extraction network, and outputting audio mode auxiliary joint behavior prediction.

Further, in step S3, the learning process of the text modal learning network specifically includes the steps of:

s35, extracting a text mode multi-behavior relation from the input text behavior feature dictionary by the text mode learning network;

and S36, applying the multi-behavior relation of the text mode to the space-time characteristics generated by the visual characteristic extraction network, and outputting text mode auxiliary joint behavior prediction.

Further, the visual modality learning network, the audio modality learning network, and the text modality learning network all employ a relation graph convolution neural network, which is represented as:

wherein the content of the first and second substances,

is a multi-row activity undirected graph

Adjacent matrix of (2), added self-join I_NIs a matrix of units, and is,

is that

A diagonal matrix of (c), σ (-) represents a non-linear activation function,

is a first-level trainable weight matrix,

a plurality of behavior relationships indicating the l-th layer, ζ represents a mode, ζ represents a visual mode when ζ ═ v, and ζ represents an audio mode when ζ ═ αWhen ζ τ represents a text mode; multi-behavior undirected graph

Is defined as

Wherein the content of the first and second substances,

is a set of nodes representing behaviors, and epsilon is a matrix formed by binary adjacency

A set of edges of the represented co-occurrence behavior.

Further, using the conditional probability psi_ij＝ψ(v_j|v_i) Representing a behavior v_jIn act v_iProbability of occurrence when it occurs, by pairing { v } behaviors in the training set_j|v_iAnd the behavior v_iThe number of occurrences to calculate psi_ijAnd further in psi_ijSet a threshold t to_ijBinarization as an initialization, i.e. if ψ_ijIf > t, let A _ij1, otherwise A _ij0, thereby introducing the probability of occurrence of a behavior as a binary adjacency matrix a.

Further, the model error for jointly training the multi-modal joint learning network is expressed as:

wherein R represents an actual observation value, H represents the visual feature extraction network, and G_v、G_α、G_τRespectively representing the visual modality learning network, the audio modality learning network, and the text modality learning network,

representing a prediction of visual modality behavior obtained by the visual feature extraction network in conjunction with the visual modality learning network,

representing audio modality-assisted joint behavior prediction by the visual feature extraction network in conjunction with the audio modality learning network,

representing a text modality assisted joint behavior prediction by the visual feature extraction network in conjunction with the text modality learning network,

a multi-modal joint behavior prediction representing the multi-modal joint learning network,

representing a loss function;

in the joint training process, the relation representation of a specific mode firstly receives an error gradient to update the weights of the three relation graph convolution neural networks so as to minimize loss, and then the error is propagated to the visual feature extraction network from the three relation graph convolution neural networks through the shared space-time representation so as to adjust the weights of the three relation graph convolution neural networks correspondingly, so that the multi-mode joint learning network can be trained in a joint learning mode through multiple modes, the relation graph convolution neural networks are forced to learn more accurate relation prediction from space-time features, and the visual feature extraction network is used for modeling stronger and more relevant space-time features from videos.

Further, the final behavior prediction value generated by the multi-modal joint learning network is expressed as:

wherein X represents the visual feature extraction network outputThe time-space characteristics of the dynamics are,

representing broadcast of X in a characteristic dimension, X_αDictionary of audio behavior features representing static state, X_τA dictionary of textual behavior features representing a static state,

representing inputs to the visual modality learning network

Prediction of (1), G_α(X_α) Representing an input X to the audio modality learning network_αPrediction of (1), G_τ(X_τ) Representing an input X to the text modal learning network_τAnd (4) predicting.

Further, the audio behavior feature dictionary and the text behavior feature dictionary are each defined as a set L of pairs (f, s), where the form f is an embedded feature in a finite dimension and the meaning s is the corresponding behavior in a given set of behaviors; features corresponding to multiple behaviors are called word-polysemy, and features belonging to one behavior are called synonyms; representing the audio and text feature dictionaries as sets L, respectively_αAnd L_τWherein the audio and text embed features f_αAnd f_τIn a corresponding form, the behavior s is significant;

the behavior characteristics of the audio modal learning network and the text modal learning network are initialized by querying a corresponding dictionary, the node characteristics are modeled by traversing all meanings, and the form of synonyms is queried from the dictionary, so that the audio modal learning network and the text modal learning network can infer the semantic relation between all modeled behaviors and the node characteristics.

Further, the visual modal learning network, the audio modal learning network and the text modal learning network all adopt a relational graph convolutional neural network with a two-layer structure.

The invention provides a multi-mode joint learning method for multi-line behavior recognition of videos, which is characterized in that visual, audio and text multi-mode GCNs are constructed based on a relation graph convolution neural network GCN, and spatio-temporal characteristics are learned through a visual characteristic extraction network (3D convolution neural network, 3D-CNN), so that behavior expressions of specific modes are input into the multi-mode GCNs as node characteristics, multi-line behavior relations sensed by the modes are explored, and audio and text embedding are inquired from respective characteristic dictionaries. Ablation studies, multi-behavioral relationship visualization, and lifting analysis all show the effectiveness of multi-modal multi-behavioral relationship modeling. In addition, the method realizes the most advanced performance on a large-scale multi-behavior reference data set M-MiT.

Drawings

FIG. 1 is a block diagram of a multimodal joint learning network provided by an embodiment of the present invention;

FIG. 2 is a diagram of an example of a multi-behavior Grad-CAM visualization with concurrent actions provided by an embodiment of the present invention;

FIG. 3 is a diagram of an exemplary representation of feature variations and behavior prediction scores for multiple behavior relationships for layers of a GCN according to an embodiment of the present invention;

fig. 4 is a diagram illustrating effect enhancement of multi-modal multi-behavior GCNs and visual GCNs in different behavior categories according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, which are given solely for the purpose of illustration and are not to be construed as limitations of the invention, including the drawings which are incorporated herein by reference and for illustration only and are not to be construed as limitations of the invention, since many variations thereof are possible without departing from the spirit and scope of the invention.

Multimedia data is typically a transmission medium for a variety of information, for example, in video, visual, auditory and textual information are often simultaneously transmitted. Therefore, multimodal learning is gradually developing into a main method of multimedia content analysis and understanding. Among them, visual modalities are widely used with their rich characterization capabilities. In addition, the multi-modal association is generally considered to have stronger representation capability, and unlike the existing video multi-modal learning, the example provides a new multi-modal association learning method, which accurately identifies all behaviors in the video according to the observation result in the real world and explores a multi-behavior relationship model in the video.

Recently, GCN (relational graph convolutional neural network) has also been used to explore relationships in video due to its powerful relational modeling capability. But this example does not just discover relationships from video frames, but sets behaviors as graph network nodes to build a multi-modal multi-behavior GCN to explore multi-behavior relationships of particular modalities in the video. The example mainly designs a multi-modal joint learning network for multi-behavior video recognition according to the following three observations:

(1) visual frames are much more important to the human daily experience and the way to understand the world than other modes (more than 80% of the information transmitted to the brain is visual);

(2) the sound is determined by the behavior attribute and provides information, and a human can construct a sound behavior mapping according to the experience in the brain;

(3) the human brain can also associate behaviors with their language tags (meaning words) to create a text-to-behavior map.

In practice, the behaviors in the video appear first as visual spatial and temporal frames, and they are strongly correlated with the synchronously recorded audio, and finally they are literally correlated with each other (tag text). Therefore, leveraging these multi-modal information (i.e., frames, audio, and text) in the video to explore multi-behavioral relationships can greatly aid in identifying multiple behaviors and understanding complex videos.

Based on this, this example provides a multimodal joint learning method for video multi-behavior recognition, specifically including the steps of:

s1, constructing a multi-modal joint learning network shown in figure 1, wherein the multi-modal joint learning network comprises a visual modal learning module, an audio modal learning network (alpha) and a text modal learning network (tau);

s3, inputting the visual frame data set into the visual mode learning module, inputting the audio behavior feature dictionary into the audio mode learning network, inputting the text behavior feature dictionary into the text mode learning network for joint training to output the visual mode behavior prediction (Z)^v) Audio modality assisted joint behavioral prediction (Z)^α) Joint behavioral prediction with text modality assistance (Z)^τ) Joint multi-modal joint behavior prediction Z.

Here, the order of steps S1 and S2 is not limited.

Specifically, the visual modality learning module comprises a visual feature extraction network (mainly comprising 3D-CNN) and a visual modality learning network (v). In this embodiment, the visual modality learning network (v), the audio modality learning network (α), and the text modality learning network (τ) all employ GCNs, referred to as visual GCNs, audio GCNs, and text GCNs, respectively, which are collectively referred to as GCNs in this example.

In step S3, the learning process of the visual modality learning module specifically includes the steps of:

s31, the visual feature extraction network performs feature extraction (namely space-time characterization) on the input visual frame data set, and generates space-time features X which are broadcast to the visual modal learning network and serve as node features of N behaviors;

s32, the visual modal learning network enhances the node characteristics of N behaviors, then averages the N behaviors in the behavior dimension, and outputs a visual modal behavior prediction Z^v。

Visual modalities have a strong ability to characterize behavior in video. While 3D-CNN shows powerful performance in resolving and representing visual modalities. Thus, this example models visual behavior features using 3D-CNN spatiotemporal features. In the visual modality, behaviors are dynamically streamed among multiple frames, and they are diverse and diversified. In essence, the 3D-CNN learns the analysis behavior through continuous input frames and dynamic optimization of space-time characteristics, so that the method is more discriminative and finally generates a strong visual behavior representation. The visual characteristics imply the relationship among a plurality of behaviors, and are suitable for being used as the behavior characteristics of the visual GCN to further explore the multi-behavior representation of relationship enhancement in the visual mode.

In step S3, the learning process of the audio modality learning network specifically includes the steps of:

In step S3, the learning process of the text modal learning network specifically includes the steps of:

Here, the three networks of the visual modality learning network, the audio modality learning network, and the text modality learning network learn synchronously.

Audio and text modalities, because of their naive characterization capabilities, are often used as an aid to visual modalities to identify behaviors in video, but they still potentially contain audio-behavior and text-behavior relationships. Thus, the present example further enhances the identified spatiotemporal features by developing audio and text modalities by modeling their modality-specific behavior features for audio and text GCNs, respectively, to aggregate modality-specific multi-behavior relationships. For multi-behavior video data sets, audio and behavior are many-to-many mappings, i.e., one audio may correspond to multiple behaviors, one behavior may correspond to multiple audios, and text labels and behavior are one-to-one mappings, i.e., one label has the meaning of one behavior. Thus, the present example represents these two patterns by defining a many-to-many audio behavior feature dictionary and a one-to-one text behavior feature dictionary for behavior features of the audio GCN and the text GCN, respectively. This example uses the VGGish model and Glo ve model to represent all audio and text labels of a video data set and builds an audio behavior feature dictionary and a text behavior feature dictionary in the form of audio and word embedding, respectively.

The audio behavior feature dictionary and the text behavior feature dictionary are each defined as a set L of pairs (f, s), where the form f is an embedded feature in a finite dimension and the meaning s is the corresponding behavior in a given set of behaviors; features corresponding to multiple behaviors are called word-polysemy, and features belonging to one behavior are called synonyms; representing the audio and text feature dictionaries as sets L, respectively_αAnd L_τWhere the audio and text embed features L_αAnd f_τIn a corresponding fashion, the behavior s is significant.

The behavior characteristics of the audio modal learning network and the text modal learning network are initialized by inquiring the corresponding dictionaries, the node characteristics are modeled by traversing all meanings, and the forms of synonyms are inquired from the dictionaries, so that the audio modal learning network and the text modal learning network can infer the semantic relation between all modeled behaviors and the node characteristics.

In this example, the visual modality learning network, the audio modality learning network, and the text modality learning network all use a relation graph convolution neural network GCN, and are represented by using a multi-layer network mode and a hierarchical propagation rule:

wherein the content of the first and second substances,

is a multi-row activity undirected graph

Adjacent matrix of (2), added self-join I_NIs a matrix of units, and is,

is that

A diagonal matrix of (c), σ (-) represents a non-linear activation function,

is a first-level trainable weight matrix,

a plurality of behaviors representing the l-th layer, ζ representing a modality, a visual modality when ζ ═ v, an audio modality when ζ ═ α, and a text modality when ζ ═ τ; multi-behavior undirected graph

Is defined as

Wherein the content of the first and second substances,

A set of edges of the represented co-occurrence behavior.

Here, the conditional probability ψ is used in this example_ij＝ψ(v_j|v_i) Representing a behavior v_jIn act v_iProbability of occurrence when it occurs, by pairing { v } behaviors in the training set_j|v_iAnd the behavior v_iThe number of occurrences to calculate psi_ijAnd further in psi_ijSet a threshold t to_ijBinarization as an initialization, i.e. if ψ_ijIf > t, let A_ij1, otherwise A_ij0, thereby introducing the probability of occurrence of a behavior as a binary adjacency matrix a.

The multi-behavior GCN structure constructed by the embodiment can explore the relationship among multiple behaviors. In essence, multi-behavior GCNs affect each behavior by aggregating features of neighboring behaviors, thereby learning a new representation of the relationship of one behavior to other behaviors. In this way, multiple behavior relationships are progressively aggregated and propagated to multiple GCN layers based on input node characteristics. In fact, multiple behaviors in the video exist in a multi-modal manner, and therefore, in order to better explore the relationship among the multiple behaviors, it is beneficial and crucial to construct a multi-modal GCN to utilize different node features of the multi-modal.

Behaviors in video have various modal representations, i.e., visual, audio, and text, that play different roles in representing behaviors. Thus, this example constructs a multi-modal multi-behavior graph network from a video dataset with three modalities, and in this work a two-layer GCN structure (l ═ 0,1} in equation (1)) is simply adopted for each modality, where the three modalities are visual (ζ ═ ν), audio (ζ ═ α), and text (ζ ═ τ), respectively. The spatiotemporal representation of the video contains the richest discriminatory features for identifying behavior, so this example uses 3D-CNN to extract spatiotemporal features and input them into graph nodes for relationship-enhanced classification to obtain visual GCN. Unlike the visual modality, audio and text in video primarily assist in recognizing behavior due to their naive characterization capabilities, and spatiotemporal features corresponding to behavior are typically dynamically changing and diverse, while audio and text are relatively static. Thus, this example designs an audio behavioral characteristic dictionary and a text behavioral characteristic dictionary for a video data set and treats them as graph node features for exploring multi-behavioral relationships from audio and text patterns to aid in visual patterns to generate audio GCN and text GCN, respectively.

Formally, for visual modalities, this example uses spatiotemporal features generated by 3D-CNN

(C is a behavioral dimension) broadcast to

As the node characteristics of N behaviors, the characteristics are obtained after the visual GCN aggregation relation is enhanced

) Then is aligned with

Averaging over the behavioral dimension to output a visual modal behavior prediction

For audio modalities, this example represents lexicographic audio embedding as

(P is the audio dimension) as a graph behavior feature, multiple behavior relationships of the audio modality

Can be represented by X in audio GCN_αTransferring, and finally applying the relation of the audio modes to the space-time characteristic X to obtain the behavior prediction of the audio modes

Similarly, for the text modality, this example represents lexicographical text embedding as

(Q is the text dimension) as a behavior in the graph, so that the text GCN will aggregate the text modalities multi-behavior relationships

For further text modality assisted joint behavior prediction

For the entire model learning, this example has three modality-specific GCN models (G)_v，G_α，G_τ) For relational reasoning, a visual modality 3D-CNN model H for spatio-temporal characterization learning, where the 3D-CNN shares the output spatio-temporal features X with three GCNs for aggregating and propagating multi-behavior relations to generateThe final behavior prediction is generated and compared to the actual behavior signature R (actual observations) to obtain the model error calculated by the loss function, as follows:

wherein the content of the first and second substances,

the visual modal behavior prediction obtained by the visual feature extraction network and the visual modal learning network is expressed,

representing audio modality assisted joint behavior prediction by a visual feature extraction network in conjunction with an audio modality learning network,

the method represents the auxiliary joint behavior prediction of the text mode obtained by the visual feature extraction network and the joint text mode learning network,

represents a multi-modal joint behavioral prediction for a multi-modal joint learning network,

the loss function is represented.

During the joint training process, the modality-specific relationship characterization will first receive an error gradient to update the weights of the three relationship graph convolutional neural networks to minimize loss, and then propagate the error from the three relationship graph convolutional neural networks to the visual feature extraction network through the shared spatio-temporal characterization to adjust their weights accordingly, so that the multimodal joint learning network can be trained in a joint learning manner through multiple modalities, the relationship graph convolutional neural networks are forced to learn more accurate relationship predictions from spatio-temporal features, and the visual feature extraction network is used to model stronger and more relevant spatio-temporal features from the video.

Since each modality has its specific information and characterization capabilities, this example uses different methods to handle the different modalities. In particular, dynamic spatiotemporal features X are most influential in identifying behavior from video and are therefore considered the main information stream for model learning, while static audio behavior feature dictionaries and text behavior feature dictionaries (X)_αAnd X_τ) Usually assist in recognizing behavior and are therefore considered an auxiliary stream. As video frames are dynamically loaded into the 3D-CNN, spatio-temporal tokens are gradually learned while audio and text queries from the corresponding fixed dictionary are embedded into the GCN that is simultaneously input into a particular modality to serve as an aid. Furthermore, the present example combines a spatio-temporal representation with audio and textual multi-behavior relationships for respective behavior predictions, and all three modality-specific behavior predictions are eventually fused to produce a final behavior prediction value Z, as follows:

wherein the content of the first and second substances,

representing inputs to a visual modality learning network

Prediction of (1), G_τ(X_α) Representing input X to an audio modality learning network_αPrediction of (1), G_τ(X_τ) Representing input X to a text modal learning network_τAnd (4) predicting. By doing so, the information of the three modalities is combined to learn a better relational representation to identify multiple behaviors.

In order to solve the challenging multi-behavior video recognition problem, the example proposes a multi-modal-based GCN that explores multi-behavior relationships of a specific modality by using the powerful relationship representation capability of a graph network and rich multi-modal information in video. Specifically, the present example constructs a multi-behavior graph network with multiple behaviors as nodes and co-occurrence probabilities of the behaviors as adjacency matrices, then constructs a multi-modal GCN for exploring modal-aware multi-behavior relationships, queries audio and text embeddings from respective feature dictionaries by representing spatio-temporal features learned as node features, i.e., 3D-CNNs, with the modal-specific behaviors, and finally, applies the audio and text relationships to the spatio-temporal features to generate respective relational behavior predictions, which are further combined with the visual relational behavior predictions to generate final predictions.

Experimental verification is performed below.

This example was mainly based on the recently released Multi-Moments in Time (M-MiT) dataset, which is considered as a massive Multi-behavior dataset for video understanding. M-MiT V1 contains 102 ten thousand 3-second videos for a total of 201 ten thousand tags containing 313 behavior classes that are annotated from the behavior vocabulary (e.g., skateboarding). In the training set, 553535 videos were annotated with multiple behaviors, of which 257491 videos were annotated with three or more behaviors. M-Mit V2 is an updated version of V1, which has a revised behavior vocabulary, including 100 million videos, 292 behavior classes with a total of 192 million tags, and a training set including 525542 videos with multiple behavior annotations and 243083 videos with three or more behavior annotations.

The task of multi-behavior video recognition is to recognize all the behaviors that occur in a video. However, for the M-MiT dataset, nearly 50% of the videos are annotated with only one behavior. To better explore multi-behavior video recognition, this example builds a new dataset based on the M-MiT dataset that will contain each video tagged with multiple behaviors while maintaining the integrity of the original category. To this end, for the training set, this example first deletes videos without audio streams, then randomly selects 300 videos for the category containing over 300 videos, and selects all videos for the remaining categories. By doing so, a "Mini M-MiT" training set was obtained, 93206 videos containing 313 behavior categories. Compared with the original M-MiT data set, the Mini M-MiT data set only accounts for 10% of the data volume, and is more suitable for rapid algorithm development and verification.

IG-65M is a very large pre-training data set that includes videos generated by over 6500 million public users from social media websites. Kinetic-400 is a classic benchmark for behavior recognition, involving 246k training and 20k verification video. This example uses R (2+1) D-34 as the 3D-CNN, pre-trained by Kinetics-400 fine tuning on a published IG-65M pre-training model (top-1 accuracy: 80.5).

The audio behavior feature dictionary is a set of behavior index features consisting of audio features corresponding to each behavior of the data set. First, all silent audio in the M-MiT is deleted to ensure that all audio in the dictionary is valid. The VGGish network is then employed to extract features of the selected audio of size 3 × 128. And carrying out post-processing on the extracted features by further adopting PCA whitening because redundant information exists in the audio data. And finally, storing the audio characteristics according to the behavior categories to obtain an audio behavior characteristic dictionary.

Similarly, a dictionary of textual behavior features is a set of behavior index word features that depend on the behavior vocabulary. This example uses the GloVe network to extract word embedding for all behaviors in the M-MiT vocabulary, where each behavior corresponds to a feature vector of size 300, creating a text behavior feature dictionary containing all behavior word vectors.

This example performs data enhancement operations on a temporal and spatial scale: 8 consecutive frames are randomly sampled using a sampling step size of 2. The input frame is cropped by multi-scale random cropping and then resized to 112 x 112. The cropping window size is dxd, where d is the product of the input shorter side length and the scaling factor in [0.7,0.875 ].

In this embodiment, the constructed multi-modal joint learning network/model is trained and verified on 8 NVIDIA RTX 2080Ti GPUs, and the minimum batch size is set to 8/GPU (64 in total) during training, and batch normalization operation is performed. For the Mini M-MiT dataset the training process lasts a total of 30 times (epoch), the initial learning rate is 0.05, the attenuation is performed in 12 and 24 times with an attenuation factor of 0.1, the first 3 times are also used for learning rate warm-up, for the complete M-MiT dataset the initial learning rate is set to 0.01, no warm-up is needed. The network was trained using a SGD-optimized binary cross-entropy loss (binary cross-entropy loss) with a momentum of 0.9 and a weight decay of 0.0001. t is set to 0.4 to binarize the adjacency matrix a. All experiments were performed with PyTorch 1.3, this example using mixed precision training.

This example will report the average accuracy (mAP), top-1 and top-5 classification accuracy for all experiments, where mAP is considered the primary assessment indicator because it can capture errors in the ordering of video-related behavior. For each positive sample label, the mAP calculates the proportion of its previously ranked related labels and then averages all labels. the accuracy of top-1 and top-5 represent the percentage of positive samples of test video pairs in any of the top-predicted category and top-predicted 5 category, respectively.

This example will perform multiple segment tests to evaluate the model synthetically, sampling temporal segments evenly from each video, and then cropping spatial regions from each frame of these segments. Specifically, 10 temporal segments are uniformly extracted from the entire video and 3 spatial clipping regions (two sides and one center) are used. Spatial full convolution inference is performed, scaling the short edge of each video frame to 128, while maintaining the aspect ratio. The final predictions were the highest score (mAP) and the average score (top-1 and top-5) of all fragments.

In this example, an ablation experiment is performed on the constructed mini M-MiT data set, and the effectiveness of multi-modal multi-behavior relational modeling is verified by using pre-trained R (2+1) D-34 as a baseline model, and the ablation experiment is started from a baseline 3D-CNN model R (2+1) D. The model takes a Full Connection (FC) layer as a classifier

It has no GCN structure and involves only visual modalities. This example uses firstVision

Replacing FC for R (2+1) D, the spatiotemporal features are enhanced by exploring visual multi-behavioral relationships for final behavioral prediction. Table 1 shows the results of combining different models and involving different modalities, indicating the present example

The model outperforms the baseline 3D-CNN model in terms of mAP, top-1 and top-5, and therefore it can be seen that the visual GCN shown in this example does positively impact performance improvement.

The 3D-CNN is then combined with the corresponding GCN (audio GCN or text GCN), adding an additional modality (audio or text) to the visual modality, resulting in two combined models

And

audio and text behavior predictions were generated separately and the results are shown in table 1. It can be observed that by combining the mode-specific GCN with the additional modes, both top-1 and top-5 accuracy are improved, while the maps are significantly improved by more than 3%, indicating the effectiveness of the audio and text GCN of this example in exploring valid multi-behavior relationships. In addition, this example also combines visual GCN with audio GCN or text GCN to obtain a combined model

Or

And the behavior predictions of two particular modalities are fused by removing one modality from equation (3), the results in table 1 also show that they bring additional performance gains.

TABLE 1 ablation study of multimodal joint learning

Further, the three modes are combined to obtain a combined model

(not including the visual GCN),

contains all modes of GCN, and Table 1 shows

Including all modalities but without visual GCN and

there are two modalities but comparable results (same top-1 accuracy and mAP) were obtained with visual GCN in combination. This demonstrates the effectiveness of the visual multi-behavior relationship. Meanwhile, the 3D-CNN is combined with the GCN with the specific three modes, the multi-mode multi-behavior relation is explored, the highest mAP is obtained, and the effectiveness of the multi-mode joint learning of the embodiment is proved. It is noted that the multi-mode GCNs of the present example can provide significant improvements with only a small parameter cost, e.g., the present example

And

the mAP was increased by 3.2% and 3.4% over the baseline 3D-CNN, but only 0.76M and 0.67M parameters were introduced. In addition, this example is in the model

And

different 3D-CNNs (R3D-18 and I3D-50) were tried and effective results (mAP%) were obtained: R3D-18(45.8,49.1,49.5,50.7) and I3D-50(53.1,55.6,55.8,57.3)。

In addition, the present example further studies to find that the method of the present example significantly improves the mAP by 3% for two modes compared to one mode, mainly because of the introduction of additional modes and the multi-mode joint learning designed by the present example, while the method of the present example provides a slight performance improvement for three modes compared to two modes, which is believed to be due to the fact that the naive characterization capability of the auxiliary mode (audio or text) results in less additional multi-row relationship exploration under the same characterization mechanism (i.e., GCN and multi-mode joint learning).

For the audio behavior feature dictionary, this example traverses all behaviors to obtain the synonym feature of each behavior to initialize the node feature of the audio GCN, so this example analyzes how much better the synonym feature is to be obtained for a behavior. Thus, this example performed ablation experiments with the number of synonym features (f) set to 1, 2 and 3, table 2

The results shown in (a) indicate that the behavior can be represented by many different audios due to their natural many-to-many mapping, but it is preferable to select only one audio to represent the behavior of the audio GCN.

TABLE 2 Audio and text dictionary ablation study

For the text behavior feature dictionary, since behaviors usually have one-to-one mapping relationship with text labels (from behavior vocabulary), this example studies whether different word embedding methods are important. This example constructs a text feature dictionary using GloVe and BERT, respectively, with 300-or 768-dimensional vectors representing each behavior. TABLE 2

It is shown that the accuracy of the behavior prediction is almost the same regardless of which of GloVe and BERT is used in this example. Furthermore, the audio is combined with the model

Federated with text model

Compared with the prior art, the performance is similar, and the two modes play a similar role in assisting in identifying a plurality of behaviors.

This example also combines audio and text modalities into one audio-text modality, provides audio-text behavior characterization for an audio-text GCN by combining audio and text dictionaries, and the results are shown in Table 2

The superiority of the combination of the audio text modes is shown, which is actually the same as that in Table 1

The performance is similar. The present example recognizes that the merged audio-text GCN actually attempts to explore audio and text multi-behavior relationships simultaneously in one large model, thereby achieving similar performance as two separate small audio and text GCNs.

This example visualizes an attention model for 3D-CNN learning using a gradient-class activation map (Grad-CAM) for localizing behaviors occurring in video, and FIG. 2 shows an example of a comparison with a baseline 3D-CNN model

And multimodal Combined model of the example

The large difference between the learned 3D-CNN, the multi-mode joint learning of the surface example can realize the 3D-CNN optimization training, and the main difference is that the model of the example can locate a plurality of behaviors presented in each scene. The present example takes the first row as an example,

trained attention only includes: "swimming" sum "Wet "area, and the model of the example

Not only "swimming" and "wet" areas can be concerned, but also "submerging" and "diving", and similar phenomena can be found in other examples. In this example, it is considered that, due to the model joint learning manner in this example, the 3D-CNN model can benefit well from the multi-modal GCN model, and obtains the return error through the shared spatio-temporal representation, so as to generate a more powerful and effective spatio-temporal relationship feature to better explore the multi-behavior relationship of a specific modality in the video.

This example further attempts to demonstrate the multiple behavior relationships taught by this example. Fig. 3(a), (b) and (c) show feature changes in the GCN layer through T-distributed random neighbor embedding visualization (T-SNE), which shows that the target behaviors (shaded numbers) gradually aggregate as they pass through the GCN layer, showing the ability to correlate multiple behaviors. FIGS. 3(d), (e) show baselines

And of this example

The behavior prediction scores in the example show that the model of the example can promote multiple target behaviors and inhibit non-target behaviors, thereby proving the effectiveness of potential multi-behavior relationship exploration.

Figure 4 shows the effect enhancement of visual GCN on different behavior classes using the multi-modal multi-behavior GCN listed in table 1 of this example. This example represents the increase rate of the mAP of the model by dividing the difference of mAP between models by the mAP of the target model. This example shows that:

(1)

and

compared with the prior art, the method has a little improvement and the performance is improvedTo be embodied in categories having a visual multi-behavior relationship, such as "child talking" (child + talking), "brow" (hearing), and "crying" (crying);

(2)

ratio of

Significant performance gains in categories with audio multi-behavior relationships, e.g., simultaneous "shaking" and "shaking" can be related by audio;

(3)

it also helps to identify multiple actions with associated literal meaning, such as "open", "close", and "lock";

(4)

the audio and text multi-behavior relation is combined, so that remarkable improvement is brought;

(5)

the performance is improved by integrating the advantages of all three modality-specific multi-behavior relationships, resulting in the highest mAP (see Table 1 of this example).

Table 3 shows a comparison with the most advanced method on the M-MiT dataset, the model of this example performs best on V1. Since V2 was recently released 10 months of 2020, no comparison results are available, but the present example still provides results for reference. This shows that the best model of the three modalities of this example, using a lighter weight backbone network, is an improvement of about 3% over M-MiT in the maps. M-MiT used a SoundNet and wLSEP loss and behavior tag statistics for audio feature learning, whereas the visual-audio ({ v, α }) model of this example performed 2.2% higher by ap. Another recent work, TIN, demonstrated only mAP (62.2) on M-MiT (so this example is not listed in the table), which performs less than the method of this example. In fact, this example can further exploit the potential of this example solution by using a more powerful 3D-CNN or sampling more input frames, e.g., this example extends 8 frames to 16 frames, which can yield a 0.9% mAP boost on M-MiT V1.

TABLE 3 comparison of M-MiT V1 and V2

Furthermore, in this work, this example attempted to propose a new approach to multi-modal multi-behavioral video understanding, with the newly published M-MiT datasets (V1 in 2019 and V2 in 2020) being the perfect reference datasets for this study, involving multi-modal and multi-behavioral and their turn-and-turn references (e.g., "play music", "drum beat" and "dance"). In addition, this example evaluated the model of this example on a Charads dataset whose labeling takes little account of audio multi-behavior cross-referencing (MultiTHUMOS-like), which attempts to combine only visual and textual modalities and still provide a 2% improvement in mAP over the baseline 3D-CNN model.

In summary, the multi-modal multi-behavior relationships in video are explored by utilizing the relationship GCN and video multi-modal. Ablation studies, multi-behavioral relationship visualization, and lifting analysis all validated the multi-modal multi-behavioral GCN and multi-modal joint learning of this example, as it has powerful multi-behavioral relationship modeling capabilities. The method of this example achieves the most advanced performance on the latest large-scale multi-behavioral M-MiT reference dataset.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A multi-modal joint learning method for video multi-behavior recognition is characterized by comprising the following steps:

2. The multimodal joint learning method for multi-behavior recognition of video according to claim 1, wherein the visual modality learning module comprises a visual feature extraction network and a visual modality learning network; in step S3, the learning process of the visual modality learning module specifically includes the steps of:

3. A multi-modal joint learning method for multi-modal video behavior recognition according to claim 2, wherein in the step S3, the learning process of the audio modal learning network specifically comprises the steps of:

4. A multi-modal joint learning method for multi-modal video behavior recognition according to claim 3, wherein in the step S3, the learning process of the text modal learning network specifically comprises the steps of:

5. A multimodal joint learning method for multi-modal video behavior recognition according to claim 4, wherein the visual modality learning network, the audio modality learning network, and the text modality learning network all employ a relational graph convolution neural network represented as:

wherein the content of the first and second substances,

is a multi-row activity undirected graph

Adjacent matrix of (2), added self-join I_NIs a matrix of units, and is,

is that

A diagonal matrix of (c), σ (-) represents a non-linear activation function,

is a first-level trainable weight matrix,

Is defined as

Wherein the content of the first and second substances,

A set of edges of the represented co-occurrence behavior.

6. The multimodal joint learning method for multi-modal behavior recognition of video according to claim 5, wherein: using conditional probability psi_ij＝ψ(v_j|v_i) Representing a behavior v_jIn act v_iProbability of occurrence when it occurs, by pairing { v } behaviors in the training set_j|v_iAnd the behavior v_iThe number of occurrences to calculate psi_ijAnd further in psi_ijSet a threshold t to_ijBinarization as an initialization, i.e. if ψ_ij>t, let A_ij1, otherwise A_ij0, thereby introducing the probability of occurrence of a behavior as a binary adjacency matrix a.

7. The multi-modal joint learning method for multi-modal behavior recognition of video according to claim 6, wherein the model error for jointly training the multi-modal joint learning network is expressed as:

representing a loss function;

8. A multi-modal joint learning method for multi-modal behavior recognition of video according to claim 7, wherein the final behavior prediction values generated by the multi-modal joint learning network are expressed as:

wherein X represents the dynamic space-time characteristics output by the visual characteristic extraction network,

representing inputs to the visual modality learning network

9. The multi-modal joint learning method for multi-behavior recognition of video according to claim 8The method is characterized in that: the audio behavioral feature dictionary and the text behavioral feature dictionary are each defined as a set L of pairs (f, s), where the form f is an embedded feature in a finite dimension and the meaning s is the corresponding behavior in a given set of behaviors; features corresponding to multiple behaviors are called word-polysemy, and features belonging to one behavior are called synonyms; representing the audio and text feature dictionaries as sets L, respectively_αAnd L_τWherein the audio and text embed features f_αAnd f_τIn a corresponding form, the behavior s is significant;

10. A multimodal joint learning method for multi-line behavior recognition of video according to any of claims 5-9, wherein: the visual modal learning network, the audio modal learning network and the text modal learning network all adopt a relational graph convolution neural network with a two-layer structure.