CN113807307A - Multi-mode joint learning method for video multi-behavior recognition - Google Patents

Multi-mode joint learning method for video multi-behavior recognition Download PDF

Info

Publication number
CN113807307A
CN113807307A CN202111143894.6A CN202111143894A CN113807307A CN 113807307 A CN113807307 A CN 113807307A CN 202111143894 A CN202111143894 A CN 202111143894A CN 113807307 A CN113807307 A CN 113807307A
Authority
CN
China
Prior art keywords
behavior
modal
audio
visual
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111143894.6A
Other languages
Chinese (zh)
Other versions
CN113807307B (en
Inventor
石珍生
郑海永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202111143894.6A priority Critical patent/CN113807307B/en
Publication of CN113807307A publication Critical patent/CN113807307A/en
Application granted granted Critical
Publication of CN113807307B publication Critical patent/CN113807307B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of multi-behavior video identification, and particularly discloses a multi-mode joint learning method for multi-behavior identification of videos, which comprises the following steps: s1, constructing a multi-mode joint learning network, wherein the multi-mode joint learning network comprises a visual mode learning module, an audio mode learning network and a text mode learning network; s2, preprocessing the original video data set to obtain a corresponding visual frame data set, an audio behavior feature dictionary and a text behavior feature dictionary; and S3, inputting the visual frame data set into a visual modal learning module, inputting the audio behavior feature dictionary into an audio modal learning network, and inputting the text behavior feature dictionary into a text modal learning network for joint training to output multi-modal joint behavior prediction of three modes of joint vision, audio and text. Ablation research, multi-behavior relation visualization and promotion analysis show the effectiveness of the multi-modal multi-behavior relation modeling, and the most advanced performance is realized on a large-scale multi-behavior reference data set M-MiT.

Description

Multi-mode joint learning method for video multi-behavior recognition
Technical Field
The invention relates to the technical field of multi-behavior video identification, in particular to a multi-mode joint learning method for multi-behavior video identification.
Background
Multi-behavior video identification is more challenging due to the need to identify multiple behaviors that occur simultaneously or consecutively. Modeling multi-behavior relationships is beneficial and crucial for understanding videos with multiple behaviors, while the behaviors in the video are typically presented in the form of multiple modalities.
Video understanding is a very complex and comprehensive task in computer vision, as it aims to identify activities occurring in complex environments by means of complex audiovisual videos. The activities described in the video are typically composed of several actions that may occur simultaneously or sequentially. For example, when a "show" action occurs, it is often accompanied by "applause" and "cheering" actions. Multi-behavior video recognition is a task aimed at automatically recognizing all behaviors occurring simultaneously in a video. Although considerable progress has been made in behavior recognition, multi-behavior recognition has considerable limitations. To address the task of video recognition of single behaviors, more and more work is exploring the relationships between behaviors and objects from videos. Therefore, in order to identify all behaviors occurring simultaneously in a video to better address the multi-behavior identification problem, it is beneficial and crucial to explore the relationships between multiple behaviors, i.e., multi-behavior relationships.
Recent advances in multi-behavioral video recognition have focused on studying artificially designed and extracted spatio-temporal features to train classifiers, or designing three-dimensional convolutional neural network (3D-CNN) structures to learn high-resolution spatio-temporal representations for classification. However, previous studies have not specifically considered the relationship between multiple behaviors in video. Furthermore, although multi-modal information has been used to analyze multi-behavioral videos, it is only used to extract features of the respective modalities (i.e., spatiotemporal and acoustic features of visual and audio modalities) for fusion classification, rather than exploring multi-modal multi-behavioral relationships to obtain more discriminative characterization information. Therefore, how to fully utilize multi-modal information to better explore multi-behavior relationships is a key to multi-behavior video recognition.
Disclosure of Invention
The invention provides a multi-mode joint learning method for video multi-behavior recognition, which solves the technical problems that: how to fully utilize multi-modal information for multi-modal video recognition.
In order to solve the technical problem, the invention provides a multimodal joint learning method for video multi-behavior recognition, which comprises the following steps:
s1, constructing a multi-mode joint learning network, wherein the multi-mode joint learning network comprises a visual mode learning module, an audio mode learning network and a text mode learning network;
s2, preprocessing the original video data set to obtain a corresponding visual frame data set, an audio behavior feature dictionary and a text behavior feature dictionary;
and S3, inputting the visual frame data set into a visual modal learning module, inputting the audio behavior feature dictionary into an audio modal learning network, and inputting the text behavior feature dictionary into a text modal learning network for joint training to output multi-modal joint behavior prediction of three modes of joint vision, audio and text.
Further, the visual modality learning module comprises a visual feature extraction network and a visual modality learning network; in step S3, the learning process of the visual modality learning module specifically includes the steps of:
s31, the visual feature extraction network performs feature extraction on the input visual frame data set, generates space-time features and broadcasts the space-time features to the visual modal learning network as node features of N behaviors;
and S32, the visual modal learning network enhances the node characteristics of the N behaviors, then averages the N behaviors in the behavior dimension, and outputs the visual modal behavior prediction.
Further, in step S3, the learning process of the audio modality learning network specifically includes the steps of:
s33, the audio mode learning network extracts audio mode multi-behavior relations from the input audio behavior feature dictionary;
and S34, applying the audio mode multi-behavior relation to the space-time characteristics generated by the visual characteristic extraction network, and outputting audio mode auxiliary joint behavior prediction.
Further, in step S3, the learning process of the text modal learning network specifically includes the steps of:
s35, extracting a text mode multi-behavior relation from the input text behavior feature dictionary by the text mode learning network;
and S36, applying the multi-behavior relation of the text mode to the space-time characteristics generated by the visual characteristic extraction network, and outputting text mode auxiliary joint behavior prediction.
Further, the visual modality learning network, the audio modality learning network, and the text modality learning network all employ a relation graph convolution neural network, which is represented as:
Figure BDA0003284669780000031
wherein the content of the first and second substances,
Figure BDA0003284669780000032
is a multi-row activity undirected graph
Figure BDA0003284669780000033
Adjacent matrix of (2), added self-join INIs a matrix of units, and is,
Figure BDA0003284669780000034
is that
Figure BDA0003284669780000035
A diagonal matrix of (c), σ (-) represents a non-linear activation function,
Figure BDA0003284669780000036
is a first-level trainable weight matrix,
Figure BDA0003284669780000037
a plurality of behavior relationships indicating the l-th layer, ζ represents a mode, ζ represents a visual mode when ζ ═ v, and ζ represents an audio mode when ζ ═ αWhen ζ τ represents a text mode; multi-behavior undirected graph
Figure BDA0003284669780000038
Is defined as
Figure BDA0003284669780000039
Figure BDA00032846697800000310
Wherein the content of the first and second substances,
Figure BDA00032846697800000311
is a set of nodes representing behaviors, and epsilon is a matrix formed by binary adjacency
Figure BDA00032846697800000312
A set of edges of the represented co-occurrence behavior.
Further, using the conditional probability psiij=ψ(vj|vi) Representing a behavior vjIn act viProbability of occurrence when it occurs, by pairing { v } behaviors in the training setj|viAnd the behavior viThe number of occurrences to calculate psiijAnd further in psiijSet a threshold t toijBinarization as an initialization, i.e. if ψijIf > t, let A ij1, otherwise A ij0, thereby introducing the probability of occurrence of a behavior as a binary adjacency matrix a.
Further, the model error for jointly training the multi-modal joint learning network is expressed as:
Figure BDA00032846697800000313
wherein R represents an actual observation value, H represents the visual feature extraction network, and Gv、Gα、GτRespectively representing the visual modality learning network, the audio modality learning network, and the text modality learning network,
Figure BDA00032846697800000314
representing a prediction of visual modality behavior obtained by the visual feature extraction network in conjunction with the visual modality learning network,
Figure BDA00032846697800000315
representing audio modality-assisted joint behavior prediction by the visual feature extraction network in conjunction with the audio modality learning network,
Figure BDA00032846697800000316
representing a text modality assisted joint behavior prediction by the visual feature extraction network in conjunction with the text modality learning network,
Figure BDA0003284669780000041
a multi-modal joint behavior prediction representing the multi-modal joint learning network,
Figure BDA0003284669780000042
representing a loss function;
in the joint training process, the relation representation of a specific mode firstly receives an error gradient to update the weights of the three relation graph convolution neural networks so as to minimize loss, and then the error is propagated to the visual feature extraction network from the three relation graph convolution neural networks through the shared space-time representation so as to adjust the weights of the three relation graph convolution neural networks correspondingly, so that the multi-mode joint learning network can be trained in a joint learning mode through multiple modes, the relation graph convolution neural networks are forced to learn more accurate relation prediction from space-time features, and the visual feature extraction network is used for modeling stronger and more relevant space-time features from videos.
Further, the final behavior prediction value generated by the multi-modal joint learning network is expressed as:
Figure BDA0003284669780000043
wherein X represents the visual feature extraction network outputThe time-space characteristics of the dynamics are,
Figure BDA0003284669780000044
representing broadcast of X in a characteristic dimension, XαDictionary of audio behavior features representing static state, XτA dictionary of textual behavior features representing a static state,
Figure BDA0003284669780000045
representing inputs to the visual modality learning network
Figure BDA0003284669780000046
Prediction of (1), Gα(Xα) Representing an input X to the audio modality learning networkαPrediction of (1), Gτ(Xτ) Representing an input X to the text modal learning networkτAnd (4) predicting.
Further, the audio behavior feature dictionary and the text behavior feature dictionary are each defined as a set L of pairs (f, s), where the form f is an embedded feature in a finite dimension and the meaning s is the corresponding behavior in a given set of behaviors; features corresponding to multiple behaviors are called word-polysemy, and features belonging to one behavior are called synonyms; representing the audio and text feature dictionaries as sets L, respectivelyαAnd LτWherein the audio and text embed features fαAnd fτIn a corresponding form, the behavior s is significant;
the behavior characteristics of the audio modal learning network and the text modal learning network are initialized by querying a corresponding dictionary, the node characteristics are modeled by traversing all meanings, and the form of synonyms is queried from the dictionary, so that the audio modal learning network and the text modal learning network can infer the semantic relation between all modeled behaviors and the node characteristics.
Further, the visual modal learning network, the audio modal learning network and the text modal learning network all adopt a relational graph convolutional neural network with a two-layer structure.
The invention provides a multi-mode joint learning method for multi-line behavior recognition of videos, which is characterized in that visual, audio and text multi-mode GCNs are constructed based on a relation graph convolution neural network GCN, and spatio-temporal characteristics are learned through a visual characteristic extraction network (3D convolution neural network, 3D-CNN), so that behavior expressions of specific modes are input into the multi-mode GCNs as node characteristics, multi-line behavior relations sensed by the modes are explored, and audio and text embedding are inquired from respective characteristic dictionaries. Ablation studies, multi-behavioral relationship visualization, and lifting analysis all show the effectiveness of multi-modal multi-behavioral relationship modeling. In addition, the method realizes the most advanced performance on a large-scale multi-behavior reference data set M-MiT.
Drawings
FIG. 1 is a block diagram of a multimodal joint learning network provided by an embodiment of the present invention;
FIG. 2 is a diagram of an example of a multi-behavior Grad-CAM visualization with concurrent actions provided by an embodiment of the present invention;
FIG. 3 is a diagram of an exemplary representation of feature variations and behavior prediction scores for multiple behavior relationships for layers of a GCN according to an embodiment of the present invention;
fig. 4 is a diagram illustrating effect enhancement of multi-modal multi-behavior GCNs and visual GCNs in different behavior categories according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, which are given solely for the purpose of illustration and are not to be construed as limitations of the invention, including the drawings which are incorporated herein by reference and for illustration only and are not to be construed as limitations of the invention, since many variations thereof are possible without departing from the spirit and scope of the invention.
Multimedia data is typically a transmission medium for a variety of information, for example, in video, visual, auditory and textual information are often simultaneously transmitted. Therefore, multimodal learning is gradually developing into a main method of multimedia content analysis and understanding. Among them, visual modalities are widely used with their rich characterization capabilities. In addition, the multi-modal association is generally considered to have stronger representation capability, and unlike the existing video multi-modal learning, the example provides a new multi-modal association learning method, which accurately identifies all behaviors in the video according to the observation result in the real world and explores a multi-behavior relationship model in the video.
Recently, GCN (relational graph convolutional neural network) has also been used to explore relationships in video due to its powerful relational modeling capability. But this example does not just discover relationships from video frames, but sets behaviors as graph network nodes to build a multi-modal multi-behavior GCN to explore multi-behavior relationships of particular modalities in the video. The example mainly designs a multi-modal joint learning network for multi-behavior video recognition according to the following three observations:
(1) visual frames are much more important to the human daily experience and the way to understand the world than other modes (more than 80% of the information transmitted to the brain is visual);
(2) the sound is determined by the behavior attribute and provides information, and a human can construct a sound behavior mapping according to the experience in the brain;
(3) the human brain can also associate behaviors with their language tags (meaning words) to create a text-to-behavior map.
In practice, the behaviors in the video appear first as visual spatial and temporal frames, and they are strongly correlated with the synchronously recorded audio, and finally they are literally correlated with each other (tag text). Therefore, leveraging these multi-modal information (i.e., frames, audio, and text) in the video to explore multi-behavioral relationships can greatly aid in identifying multiple behaviors and understanding complex videos.
Based on this, this example provides a multimodal joint learning method for video multi-behavior recognition, specifically including the steps of:
s1, constructing a multi-modal joint learning network shown in figure 1, wherein the multi-modal joint learning network comprises a visual modal learning module, an audio modal learning network (alpha) and a text modal learning network (tau);
s2, preprocessing the original video data set to obtain a corresponding visual frame data set, an audio behavior feature dictionary and a text behavior feature dictionary;
s3, inputting the visual frame data set into the visual mode learning module, inputting the audio behavior feature dictionary into the audio mode learning network, inputting the text behavior feature dictionary into the text mode learning network for joint training to output the visual mode behavior prediction (Z)v) Audio modality assisted joint behavioral prediction (Z)α) Joint behavioral prediction with text modality assistance (Z)τ) Joint multi-modal joint behavior prediction Z.
Here, the order of steps S1 and S2 is not limited.
Specifically, the visual modality learning module comprises a visual feature extraction network (mainly comprising 3D-CNN) and a visual modality learning network (v). In this embodiment, the visual modality learning network (v), the audio modality learning network (α), and the text modality learning network (τ) all employ GCNs, referred to as visual GCNs, audio GCNs, and text GCNs, respectively, which are collectively referred to as GCNs in this example.
In step S3, the learning process of the visual modality learning module specifically includes the steps of:
s31, the visual feature extraction network performs feature extraction (namely space-time characterization) on the input visual frame data set, and generates space-time features X which are broadcast to the visual modal learning network and serve as node features of N behaviors;
s32, the visual modal learning network enhances the node characteristics of N behaviors, then averages the N behaviors in the behavior dimension, and outputs a visual modal behavior prediction Zv
Visual modalities have a strong ability to characterize behavior in video. While 3D-CNN shows powerful performance in resolving and representing visual modalities. Thus, this example models visual behavior features using 3D-CNN spatiotemporal features. In the visual modality, behaviors are dynamically streamed among multiple frames, and they are diverse and diversified. In essence, the 3D-CNN learns the analysis behavior through continuous input frames and dynamic optimization of space-time characteristics, so that the method is more discriminative and finally generates a strong visual behavior representation. The visual characteristics imply the relationship among a plurality of behaviors, and are suitable for being used as the behavior characteristics of the visual GCN to further explore the multi-behavior representation of relationship enhancement in the visual mode.
In step S3, the learning process of the audio modality learning network specifically includes the steps of:
s33, the audio mode learning network extracts audio mode multi-behavior relations from the input audio behavior feature dictionary;
and S34, applying the audio mode multi-behavior relation to the space-time characteristics generated by the visual characteristic extraction network, and outputting audio mode auxiliary joint behavior prediction.
In step S3, the learning process of the text modal learning network specifically includes the steps of:
s35, extracting a text mode multi-behavior relation from the input text behavior feature dictionary by the text mode learning network;
and S36, applying the multi-behavior relation of the text mode to the space-time characteristics generated by the visual characteristic extraction network, and outputting text mode auxiliary joint behavior prediction.
Here, the three networks of the visual modality learning network, the audio modality learning network, and the text modality learning network learn synchronously.
Audio and text modalities, because of their naive characterization capabilities, are often used as an aid to visual modalities to identify behaviors in video, but they still potentially contain audio-behavior and text-behavior relationships. Thus, the present example further enhances the identified spatiotemporal features by developing audio and text modalities by modeling their modality-specific behavior features for audio and text GCNs, respectively, to aggregate modality-specific multi-behavior relationships. For multi-behavior video data sets, audio and behavior are many-to-many mappings, i.e., one audio may correspond to multiple behaviors, one behavior may correspond to multiple audios, and text labels and behavior are one-to-one mappings, i.e., one label has the meaning of one behavior. Thus, the present example represents these two patterns by defining a many-to-many audio behavior feature dictionary and a one-to-one text behavior feature dictionary for behavior features of the audio GCN and the text GCN, respectively. This example uses the VGGish model and Glo ve model to represent all audio and text labels of a video data set and builds an audio behavior feature dictionary and a text behavior feature dictionary in the form of audio and word embedding, respectively.
The audio behavior feature dictionary and the text behavior feature dictionary are each defined as a set L of pairs (f, s), where the form f is an embedded feature in a finite dimension and the meaning s is the corresponding behavior in a given set of behaviors; features corresponding to multiple behaviors are called word-polysemy, and features belonging to one behavior are called synonyms; representing the audio and text feature dictionaries as sets L, respectivelyαAnd LτWhere the audio and text embed features LαAnd fτIn a corresponding fashion, the behavior s is significant.
The behavior characteristics of the audio modal learning network and the text modal learning network are initialized by inquiring the corresponding dictionaries, the node characteristics are modeled by traversing all meanings, and the forms of synonyms are inquired from the dictionaries, so that the audio modal learning network and the text modal learning network can infer the semantic relation between all modeled behaviors and the node characteristics.
In this example, the visual modality learning network, the audio modality learning network, and the text modality learning network all use a relation graph convolution neural network GCN, and are represented by using a multi-layer network mode and a hierarchical propagation rule:
Figure BDA0003284669780000081
wherein the content of the first and second substances,
Figure BDA0003284669780000082
is a multi-row activity undirected graph
Figure BDA0003284669780000083
Adjacent matrix of (2), added self-join INIs a matrix of units, and is,
Figure BDA0003284669780000084
is that
Figure BDA0003284669780000085
A diagonal matrix of (c), σ (-) represents a non-linear activation function,
Figure BDA0003284669780000086
is a first-level trainable weight matrix,
Figure BDA0003284669780000087
a plurality of behaviors representing the l-th layer, ζ representing a modality, a visual modality when ζ ═ v, an audio modality when ζ ═ α, and a text modality when ζ ═ τ; multi-behavior undirected graph
Figure BDA0003284669780000091
Is defined as
Figure BDA0003284669780000092
Figure BDA0003284669780000093
Wherein the content of the first and second substances,
Figure BDA0003284669780000094
is a set of nodes representing behaviors, and epsilon is a matrix formed by binary adjacency
Figure BDA0003284669780000095
A set of edges of the represented co-occurrence behavior.
Here, the conditional probability ψ is used in this exampleij=ψ(vj|vi) Representing a behavior vjIn act viProbability of occurrence when it occurs, by pairing { v } behaviors in the training setj|viAnd the behavior viThe number of occurrences to calculate psiijAnd further in psiijSet a threshold t toijBinarization as an initialization, i.e. if ψijIf > t, let Aij1, otherwise Aij0, thereby introducing the probability of occurrence of a behavior as a binary adjacency matrix a.
The multi-behavior GCN structure constructed by the embodiment can explore the relationship among multiple behaviors. In essence, multi-behavior GCNs affect each behavior by aggregating features of neighboring behaviors, thereby learning a new representation of the relationship of one behavior to other behaviors. In this way, multiple behavior relationships are progressively aggregated and propagated to multiple GCN layers based on input node characteristics. In fact, multiple behaviors in the video exist in a multi-modal manner, and therefore, in order to better explore the relationship among the multiple behaviors, it is beneficial and crucial to construct a multi-modal GCN to utilize different node features of the multi-modal.
Behaviors in video have various modal representations, i.e., visual, audio, and text, that play different roles in representing behaviors. Thus, this example constructs a multi-modal multi-behavior graph network from a video dataset with three modalities, and in this work a two-layer GCN structure (l ═ 0,1} in equation (1)) is simply adopted for each modality, where the three modalities are visual (ζ ═ ν), audio (ζ ═ α), and text (ζ ═ τ), respectively. The spatiotemporal representation of the video contains the richest discriminatory features for identifying behavior, so this example uses 3D-CNN to extract spatiotemporal features and input them into graph nodes for relationship-enhanced classification to obtain visual GCN. Unlike the visual modality, audio and text in video primarily assist in recognizing behavior due to their naive characterization capabilities, and spatiotemporal features corresponding to behavior are typically dynamically changing and diverse, while audio and text are relatively static. Thus, this example designs an audio behavioral characteristic dictionary and a text behavioral characteristic dictionary for a video data set and treats them as graph node features for exploring multi-behavioral relationships from audio and text patterns to aid in visual patterns to generate audio GCN and text GCN, respectively.
Formally, for visual modalities, this example uses spatiotemporal features generated by 3D-CNN
Figure BDA0003284669780000096
(C is a behavioral dimension) broadcast to
Figure BDA0003284669780000097
As the node characteristics of N behaviors, the characteristics are obtained after the visual GCN aggregation relation is enhanced
Figure BDA0003284669780000101
) Then is aligned with
Figure BDA0003284669780000102
Averaging over the behavioral dimension to output a visual modal behavior prediction
Figure BDA0003284669780000103
For audio modalities, this example represents lexicographic audio embedding as
Figure BDA0003284669780000104
(P is the audio dimension) as a graph behavior feature, multiple behavior relationships of the audio modality
Figure BDA0003284669780000105
Can be represented by X in audio GCNαTransferring, and finally applying the relation of the audio modes to the space-time characteristic X to obtain the behavior prediction of the audio modes
Figure BDA0003284669780000106
Similarly, for the text modality, this example represents lexicographical text embedding as
Figure BDA0003284669780000107
(Q is the text dimension) as a behavior in the graph, so that the text GCN will aggregate the text modalities multi-behavior relationships
Figure BDA0003284669780000108
For further text modality assisted joint behavior prediction
Figure BDA0003284669780000109
For the entire model learning, this example has three modality-specific GCN models (G)v,Gα,Gτ) For relational reasoning, a visual modality 3D-CNN model H for spatio-temporal characterization learning, where the 3D-CNN shares the output spatio-temporal features X with three GCNs for aggregating and propagating multi-behavior relations to generateThe final behavior prediction is generated and compared to the actual behavior signature R (actual observations) to obtain the model error calculated by the loss function, as follows:
Figure BDA00032846697800001010
wherein the content of the first and second substances,
Figure BDA00032846697800001011
the visual modal behavior prediction obtained by the visual feature extraction network and the visual modal learning network is expressed,
Figure BDA00032846697800001012
representing audio modality assisted joint behavior prediction by a visual feature extraction network in conjunction with an audio modality learning network,
Figure BDA00032846697800001013
the method represents the auxiliary joint behavior prediction of the text mode obtained by the visual feature extraction network and the joint text mode learning network,
Figure BDA00032846697800001014
represents a multi-modal joint behavioral prediction for a multi-modal joint learning network,
Figure BDA00032846697800001015
the loss function is represented.
During the joint training process, the modality-specific relationship characterization will first receive an error gradient to update the weights of the three relationship graph convolutional neural networks to minimize loss, and then propagate the error from the three relationship graph convolutional neural networks to the visual feature extraction network through the shared spatio-temporal characterization to adjust their weights accordingly, so that the multimodal joint learning network can be trained in a joint learning manner through multiple modalities, the relationship graph convolutional neural networks are forced to learn more accurate relationship predictions from spatio-temporal features, and the visual feature extraction network is used to model stronger and more relevant spatio-temporal features from the video.
Since each modality has its specific information and characterization capabilities, this example uses different methods to handle the different modalities. In particular, dynamic spatiotemporal features X are most influential in identifying behavior from video and are therefore considered the main information stream for model learning, while static audio behavior feature dictionaries and text behavior feature dictionaries (X)αAnd Xτ) Usually assist in recognizing behavior and are therefore considered an auxiliary stream. As video frames are dynamically loaded into the 3D-CNN, spatio-temporal tokens are gradually learned while audio and text queries from the corresponding fixed dictionary are embedded into the GCN that is simultaneously input into a particular modality to serve as an aid. Furthermore, the present example combines a spatio-temporal representation with audio and textual multi-behavior relationships for respective behavior predictions, and all three modality-specific behavior predictions are eventually fused to produce a final behavior prediction value Z, as follows:
Figure BDA0003284669780000111
wherein the content of the first and second substances,
Figure BDA0003284669780000112
representing broadcast of X in a characteristic dimension, XαDictionary of audio behavior features representing static state, XτA dictionary of textual behavior features representing a static state,
Figure BDA0003284669780000113
representing inputs to a visual modality learning network
Figure BDA0003284669780000114
Prediction of (1), Gτ(Xα) Representing input X to an audio modality learning networkαPrediction of (1), Gτ(Xτ) Representing input X to a text modal learning networkτAnd (4) predicting. By doing so, the information of the three modalities is combined to learn a better relational representation to identify multiple behaviors.
In order to solve the challenging multi-behavior video recognition problem, the example proposes a multi-modal-based GCN that explores multi-behavior relationships of a specific modality by using the powerful relationship representation capability of a graph network and rich multi-modal information in video. Specifically, the present example constructs a multi-behavior graph network with multiple behaviors as nodes and co-occurrence probabilities of the behaviors as adjacency matrices, then constructs a multi-modal GCN for exploring modal-aware multi-behavior relationships, queries audio and text embeddings from respective feature dictionaries by representing spatio-temporal features learned as node features, i.e., 3D-CNNs, with the modal-specific behaviors, and finally, applies the audio and text relationships to the spatio-temporal features to generate respective relational behavior predictions, which are further combined with the visual relational behavior predictions to generate final predictions.
Experimental verification is performed below.
This example was mainly based on the recently released Multi-Moments in Time (M-MiT) dataset, which is considered as a massive Multi-behavior dataset for video understanding. M-MiT V1 contains 102 ten thousand 3-second videos for a total of 201 ten thousand tags containing 313 behavior classes that are annotated from the behavior vocabulary (e.g., skateboarding). In the training set, 553535 videos were annotated with multiple behaviors, of which 257491 videos were annotated with three or more behaviors. M-Mit V2 is an updated version of V1, which has a revised behavior vocabulary, including 100 million videos, 292 behavior classes with a total of 192 million tags, and a training set including 525542 videos with multiple behavior annotations and 243083 videos with three or more behavior annotations.
The task of multi-behavior video recognition is to recognize all the behaviors that occur in a video. However, for the M-MiT dataset, nearly 50% of the videos are annotated with only one behavior. To better explore multi-behavior video recognition, this example builds a new dataset based on the M-MiT dataset that will contain each video tagged with multiple behaviors while maintaining the integrity of the original category. To this end, for the training set, this example first deletes videos without audio streams, then randomly selects 300 videos for the category containing over 300 videos, and selects all videos for the remaining categories. By doing so, a "Mini M-MiT" training set was obtained, 93206 videos containing 313 behavior categories. Compared with the original M-MiT data set, the Mini M-MiT data set only accounts for 10% of the data volume, and is more suitable for rapid algorithm development and verification.
IG-65M is a very large pre-training data set that includes videos generated by over 6500 million public users from social media websites. Kinetic-400 is a classic benchmark for behavior recognition, involving 246k training and 20k verification video. This example uses R (2+1) D-34 as the 3D-CNN, pre-trained by Kinetics-400 fine tuning on a published IG-65M pre-training model (top-1 accuracy: 80.5).
The audio behavior feature dictionary is a set of behavior index features consisting of audio features corresponding to each behavior of the data set. First, all silent audio in the M-MiT is deleted to ensure that all audio in the dictionary is valid. The VGGish network is then employed to extract features of the selected audio of size 3 × 128. And carrying out post-processing on the extracted features by further adopting PCA whitening because redundant information exists in the audio data. And finally, storing the audio characteristics according to the behavior categories to obtain an audio behavior characteristic dictionary.
Similarly, a dictionary of textual behavior features is a set of behavior index word features that depend on the behavior vocabulary. This example uses the GloVe network to extract word embedding for all behaviors in the M-MiT vocabulary, where each behavior corresponds to a feature vector of size 300, creating a text behavior feature dictionary containing all behavior word vectors.
This example performs data enhancement operations on a temporal and spatial scale: 8 consecutive frames are randomly sampled using a sampling step size of 2. The input frame is cropped by multi-scale random cropping and then resized to 112 x 112. The cropping window size is dxd, where d is the product of the input shorter side length and the scaling factor in [0.7,0.875 ].
In this embodiment, the constructed multi-modal joint learning network/model is trained and verified on 8 NVIDIA RTX 2080Ti GPUs, and the minimum batch size is set to 8/GPU (64 in total) during training, and batch normalization operation is performed. For the Mini M-MiT dataset the training process lasts a total of 30 times (epoch), the initial learning rate is 0.05, the attenuation is performed in 12 and 24 times with an attenuation factor of 0.1, the first 3 times are also used for learning rate warm-up, for the complete M-MiT dataset the initial learning rate is set to 0.01, no warm-up is needed. The network was trained using a SGD-optimized binary cross-entropy loss (binary cross-entropy loss) with a momentum of 0.9 and a weight decay of 0.0001. t is set to 0.4 to binarize the adjacency matrix a. All experiments were performed with PyTorch 1.3, this example using mixed precision training.
This example will report the average accuracy (mAP), top-1 and top-5 classification accuracy for all experiments, where mAP is considered the primary assessment indicator because it can capture errors in the ordering of video-related behavior. For each positive sample label, the mAP calculates the proportion of its previously ranked related labels and then averages all labels. the accuracy of top-1 and top-5 represent the percentage of positive samples of test video pairs in any of the top-predicted category and top-predicted 5 category, respectively.
This example will perform multiple segment tests to evaluate the model synthetically, sampling temporal segments evenly from each video, and then cropping spatial regions from each frame of these segments. Specifically, 10 temporal segments are uniformly extracted from the entire video and 3 spatial clipping regions (two sides and one center) are used. Spatial full convolution inference is performed, scaling the short edge of each video frame to 128, while maintaining the aspect ratio. The final predictions were the highest score (mAP) and the average score (top-1 and top-5) of all fragments.
In this example, an ablation experiment is performed on the constructed mini M-MiT data set, and the effectiveness of multi-modal multi-behavior relational modeling is verified by using pre-trained R (2+1) D-34 as a baseline model, and the ablation experiment is started from a baseline 3D-CNN model R (2+1) D. The model takes a Full Connection (FC) layer as a classifier
Figure BDA0003284669780000131
It has no GCN structure and involves only visual modalities. This example uses firstVision
Figure BDA0003284669780000132
Replacing FC for R (2+1) D, the spatiotemporal features are enhanced by exploring visual multi-behavioral relationships for final behavioral prediction. Table 1 shows the results of combining different models and involving different modalities, indicating the present example
Figure BDA0003284669780000133
The model outperforms the baseline 3D-CNN model in terms of mAP, top-1 and top-5, and therefore it can be seen that the visual GCN shown in this example does positively impact performance improvement.
The 3D-CNN is then combined with the corresponding GCN (audio GCN or text GCN), adding an additional modality (audio or text) to the visual modality, resulting in two combined models
Figure BDA0003284669780000141
And
Figure BDA0003284669780000142
audio and text behavior predictions were generated separately and the results are shown in table 1. It can be observed that by combining the mode-specific GCN with the additional modes, both top-1 and top-5 accuracy are improved, while the maps are significantly improved by more than 3%, indicating the effectiveness of the audio and text GCN of this example in exploring valid multi-behavior relationships. In addition, this example also combines visual GCN with audio GCN or text GCN to obtain a combined model
Figure BDA0003284669780000143
Or
Figure BDA0003284669780000144
And the behavior predictions of two particular modalities are fused by removing one modality from equation (3), the results in table 1 also show that they bring additional performance gains.
TABLE 1 ablation study of multimodal joint learning
Figure BDA0003284669780000145
Further, the three modes are combined to obtain a combined model
Figure BDA0003284669780000146
(not including the visual GCN),
Figure BDA0003284669780000147
contains all modes of GCN, and Table 1 shows
Figure BDA0003284669780000148
Including all modalities but without visual GCN and
Figure BDA0003284669780000149
there are two modalities but comparable results (same top-1 accuracy and mAP) were obtained with visual GCN in combination. This demonstrates the effectiveness of the visual multi-behavior relationship. Meanwhile, the 3D-CNN is combined with the GCN with the specific three modes, the multi-mode multi-behavior relation is explored, the highest mAP is obtained, and the effectiveness of the multi-mode joint learning of the embodiment is proved. It is noted that the multi-mode GCNs of the present example can provide significant improvements with only a small parameter cost, e.g., the present example
Figure BDA00032846697800001410
And
Figure BDA00032846697800001411
the mAP was increased by 3.2% and 3.4% over the baseline 3D-CNN, but only 0.76M and 0.67M parameters were introduced. In addition, this example is in the model
Figure BDA00032846697800001412
And
Figure BDA00032846697800001413
different 3D-CNNs (R3D-18 and I3D-50) were tried and effective results (mAP%) were obtained: R3D-18(45.8,49.1,49.5,50.7) and I3D-50(53.1,55.6,55.8,57.3)。
In addition, the present example further studies to find that the method of the present example significantly improves the mAP by 3% for two modes compared to one mode, mainly because of the introduction of additional modes and the multi-mode joint learning designed by the present example, while the method of the present example provides a slight performance improvement for three modes compared to two modes, which is believed to be due to the fact that the naive characterization capability of the auxiliary mode (audio or text) results in less additional multi-row relationship exploration under the same characterization mechanism (i.e., GCN and multi-mode joint learning).
For the audio behavior feature dictionary, this example traverses all behaviors to obtain the synonym feature of each behavior to initialize the node feature of the audio GCN, so this example analyzes how much better the synonym feature is to be obtained for a behavior. Thus, this example performed ablation experiments with the number of synonym features (f) set to 1, 2 and 3, table 2
Figure BDA0003284669780000151
The results shown in (a) indicate that the behavior can be represented by many different audios due to their natural many-to-many mapping, but it is preferable to select only one audio to represent the behavior of the audio GCN.
TABLE 2 Audio and text dictionary ablation study
Figure BDA0003284669780000152
For the text behavior feature dictionary, since behaviors usually have one-to-one mapping relationship with text labels (from behavior vocabulary), this example studies whether different word embedding methods are important. This example constructs a text feature dictionary using GloVe and BERT, respectively, with 300-or 768-dimensional vectors representing each behavior. TABLE 2
Figure BDA0003284669780000153
It is shown that the accuracy of the behavior prediction is almost the same regardless of which of GloVe and BERT is used in this example. Furthermore, the audio is combined with the model
Figure BDA0003284669780000154
Federated with text model
Figure BDA0003284669780000155
Compared with the prior art, the performance is similar, and the two modes play a similar role in assisting in identifying a plurality of behaviors.
This example also combines audio and text modalities into one audio-text modality, provides audio-text behavior characterization for an audio-text GCN by combining audio and text dictionaries, and the results are shown in Table 2
Figure BDA0003284669780000156
The superiority of the combination of the audio text modes is shown, which is actually the same as that in Table 1
Figure BDA0003284669780000157
The performance is similar. The present example recognizes that the merged audio-text GCN actually attempts to explore audio and text multi-behavior relationships simultaneously in one large model, thereby achieving similar performance as two separate small audio and text GCNs.
This example visualizes an attention model for 3D-CNN learning using a gradient-class activation map (Grad-CAM) for localizing behaviors occurring in video, and FIG. 2 shows an example of a comparison with a baseline 3D-CNN model
Figure BDA0003284669780000161
And multimodal Combined model of the example
Figure BDA0003284669780000162
The large difference between the learned 3D-CNN, the multi-mode joint learning of the surface example can realize the 3D-CNN optimization training, and the main difference is that the model of the example can locate a plurality of behaviors presented in each scene. The present example takes the first row as an example,
Figure BDA0003284669780000163
trained attention only includes: "swimming" sum "Wet "area, and the model of the example
Figure BDA0003284669780000164
Not only "swimming" and "wet" areas can be concerned, but also "submerging" and "diving", and similar phenomena can be found in other examples. In this example, it is considered that, due to the model joint learning manner in this example, the 3D-CNN model can benefit well from the multi-modal GCN model, and obtains the return error through the shared spatio-temporal representation, so as to generate a more powerful and effective spatio-temporal relationship feature to better explore the multi-behavior relationship of a specific modality in the video.
This example further attempts to demonstrate the multiple behavior relationships taught by this example. Fig. 3(a), (b) and (c) show feature changes in the GCN layer through T-distributed random neighbor embedding visualization (T-SNE), which shows that the target behaviors (shaded numbers) gradually aggregate as they pass through the GCN layer, showing the ability to correlate multiple behaviors. FIGS. 3(d), (e) show baselines
Figure BDA0003284669780000165
And of this example
Figure BDA0003284669780000166
The behavior prediction scores in the example show that the model of the example can promote multiple target behaviors and inhibit non-target behaviors, thereby proving the effectiveness of potential multi-behavior relationship exploration.
Figure 4 shows the effect enhancement of visual GCN on different behavior classes using the multi-modal multi-behavior GCN listed in table 1 of this example. This example represents the increase rate of the mAP of the model by dividing the difference of mAP between models by the mAP of the target model. This example shows that:
(1)
Figure BDA0003284669780000167
and
Figure BDA0003284669780000168
compared with the prior art, the method has a little improvement and the performance is improvedTo be embodied in categories having a visual multi-behavior relationship, such as "child talking" (child + talking), "brow" (hearing), and "crying" (crying);
(2)
Figure BDA0003284669780000169
ratio of
Figure BDA00032846697800001610
Significant performance gains in categories with audio multi-behavior relationships, e.g., simultaneous "shaking" and "shaking" can be related by audio;
(3)
Figure BDA00032846697800001611
it also helps to identify multiple actions with associated literal meaning, such as "open", "close", and "lock";
(4)
Figure BDA0003284669780000171
the audio and text multi-behavior relation is combined, so that remarkable improvement is brought;
(5)
Figure BDA0003284669780000172
the performance is improved by integrating the advantages of all three modality-specific multi-behavior relationships, resulting in the highest mAP (see Table 1 of this example).
Table 3 shows a comparison with the most advanced method on the M-MiT dataset, the model of this example performs best on V1. Since V2 was recently released 10 months of 2020, no comparison results are available, but the present example still provides results for reference. This shows that the best model of the three modalities of this example, using a lighter weight backbone network, is an improvement of about 3% over M-MiT in the maps. M-MiT used a SoundNet and wLSEP loss and behavior tag statistics for audio feature learning, whereas the visual-audio ({ v, α }) model of this example performed 2.2% higher by ap. Another recent work, TIN, demonstrated only mAP (62.2) on M-MiT (so this example is not listed in the table), which performs less than the method of this example. In fact, this example can further exploit the potential of this example solution by using a more powerful 3D-CNN or sampling more input frames, e.g., this example extends 8 frames to 16 frames, which can yield a 0.9% mAP boost on M-MiT V1.
TABLE 3 comparison of M-MiT V1 and V2
Figure BDA0003284669780000173
Furthermore, in this work, this example attempted to propose a new approach to multi-modal multi-behavioral video understanding, with the newly published M-MiT datasets (V1 in 2019 and V2 in 2020) being the perfect reference datasets for this study, involving multi-modal and multi-behavioral and their turn-and-turn references (e.g., "play music", "drum beat" and "dance"). In addition, this example evaluated the model of this example on a Charads dataset whose labeling takes little account of audio multi-behavior cross-referencing (MultiTHUMOS-like), which attempts to combine only visual and textual modalities and still provide a 2% improvement in mAP over the baseline 3D-CNN model.
In summary, the multi-modal multi-behavior relationships in video are explored by utilizing the relationship GCN and video multi-modal. Ablation studies, multi-behavioral relationship visualization, and lifting analysis all validated the multi-modal multi-behavioral GCN and multi-modal joint learning of this example, as it has powerful multi-behavioral relationship modeling capabilities. The method of this example achieves the most advanced performance on the latest large-scale multi-behavioral M-MiT reference dataset.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims (10)

1. A multi-modal joint learning method for video multi-behavior recognition is characterized by comprising the following steps:
s1, constructing a multi-mode joint learning network, wherein the multi-mode joint learning network comprises a visual mode learning module, an audio mode learning network and a text mode learning network;
s2, preprocessing the original video data set to obtain a corresponding visual frame data set, an audio behavior feature dictionary and a text behavior feature dictionary;
and S3, inputting the visual frame data set into a visual modal learning module, inputting the audio behavior feature dictionary into an audio modal learning network, and inputting the text behavior feature dictionary into a text modal learning network for joint training to output multi-modal joint behavior prediction of three modes of joint vision, audio and text.
2. The multimodal joint learning method for multi-behavior recognition of video according to claim 1, wherein the visual modality learning module comprises a visual feature extraction network and a visual modality learning network; in step S3, the learning process of the visual modality learning module specifically includes the steps of:
s31, the visual feature extraction network performs feature extraction on the input visual frame data set, generates space-time features and broadcasts the space-time features to the visual modal learning network as node features of N behaviors;
and S32, the visual modal learning network enhances the node characteristics of the N behaviors, then averages the N behaviors in the behavior dimension, and outputs the visual modal behavior prediction.
3. A multi-modal joint learning method for multi-modal video behavior recognition according to claim 2, wherein in the step S3, the learning process of the audio modal learning network specifically comprises the steps of:
s33, the audio mode learning network extracts audio mode multi-behavior relations from the input audio behavior feature dictionary;
and S34, applying the audio mode multi-behavior relation to the space-time characteristics generated by the visual characteristic extraction network, and outputting audio mode auxiliary joint behavior prediction.
4. A multi-modal joint learning method for multi-modal video behavior recognition according to claim 3, wherein in the step S3, the learning process of the text modal learning network specifically comprises the steps of:
s35, extracting a text mode multi-behavior relation from the input text behavior feature dictionary by the text mode learning network;
and S36, applying the multi-behavior relation of the text mode to the space-time characteristics generated by the visual characteristic extraction network, and outputting text mode auxiliary joint behavior prediction.
5. A multimodal joint learning method for multi-modal video behavior recognition according to claim 4, wherein the visual modality learning network, the audio modality learning network, and the text modality learning network all employ a relational graph convolution neural network represented as:
Figure FDA0003284669770000021
wherein the content of the first and second substances,
Figure FDA0003284669770000022
is a multi-row activity undirected graph
Figure FDA0003284669770000023
Adjacent matrix of (2), added self-join INIs a matrix of units, and is,
Figure FDA0003284669770000024
is that
Figure FDA0003284669770000025
A diagonal matrix of (c), σ (-) represents a non-linear activation function,
Figure FDA0003284669770000026
is a first-level trainable weight matrix,
Figure FDA0003284669770000027
a plurality of behaviors representing the l-th layer, ζ representing a modality, a visual modality when ζ ═ v, an audio modality when ζ ═ α, and a text modality when ζ ═ τ; multi-behavior undirected graph
Figure FDA0003284669770000028
Is defined as
Figure FDA0003284669770000029
Figure FDA00032846697700000210
Wherein the content of the first and second substances,
Figure FDA00032846697700000211
is a set of nodes representing behaviors, and epsilon is a matrix formed by binary adjacency
Figure FDA00032846697700000212
A set of edges of the represented co-occurrence behavior.
6. The multimodal joint learning method for multi-modal behavior recognition of video according to claim 5, wherein: using conditional probability psiij=ψ(vj|vi) Representing a behavior vjIn act viProbability of occurrence when it occurs, by pairing { v } behaviors in the training setj|viAnd the behavior viThe number of occurrences to calculate psiijAnd further in psiijSet a threshold t toijBinarization as an initialization, i.e. if ψij>t, let Aij1, otherwise Aij0, thereby introducing the probability of occurrence of a behavior as a binary adjacency matrix a.
7. The multi-modal joint learning method for multi-modal behavior recognition of video according to claim 6, wherein the model error for jointly training the multi-modal joint learning network is expressed as:
Figure FDA0003284669770000031
wherein R represents an actual observation value, H represents the visual feature extraction network, and Gv、Gα、GτRespectively representing the visual modality learning network, the audio modality learning network, and the text modality learning network,
Figure FDA0003284669770000032
representing a prediction of visual modality behavior obtained by the visual feature extraction network in conjunction with the visual modality learning network,
Figure FDA0003284669770000033
representing audio modality-assisted joint behavior prediction by the visual feature extraction network in conjunction with the audio modality learning network,
Figure FDA0003284669770000034
representing a text modality assisted joint behavior prediction by the visual feature extraction network in conjunction with the text modality learning network,
Figure FDA0003284669770000035
a multi-modal joint behavior prediction representing the multi-modal joint learning network,
Figure FDA0003284669770000036
representing a loss function;
in the joint training process, the relation representation of a specific mode firstly receives an error gradient to update the weights of the three relation graph convolution neural networks so as to minimize loss, and then the error is propagated to the visual feature extraction network from the three relation graph convolution neural networks through the shared space-time representation so as to adjust the weights of the three relation graph convolution neural networks correspondingly, so that the multi-mode joint learning network can be trained in a joint learning mode through multiple modes, the relation graph convolution neural networks are forced to learn more accurate relation prediction from space-time features, and the visual feature extraction network is used for modeling stronger and more relevant space-time features from videos.
8. A multi-modal joint learning method for multi-modal behavior recognition of video according to claim 7, wherein the final behavior prediction values generated by the multi-modal joint learning network are expressed as:
Figure FDA0003284669770000037
wherein X represents the dynamic space-time characteristics output by the visual characteristic extraction network,
Figure FDA0003284669770000038
representing broadcast of X in a characteristic dimension, XαDictionary of audio behavior features representing static state, XτA dictionary of textual behavior features representing a static state,
Figure FDA0003284669770000039
representing inputs to the visual modality learning network
Figure FDA00032846697700000310
Prediction of (1), Gα(Xα) Representing an input X to the audio modality learning networkαPrediction of (1), Gτ(Xτ) Representing an input X to the text modal learning networkτAnd (4) predicting.
9. The multi-modal joint learning method for multi-behavior recognition of video according to claim 8The method is characterized in that: the audio behavioral feature dictionary and the text behavioral feature dictionary are each defined as a set L of pairs (f, s), where the form f is an embedded feature in a finite dimension and the meaning s is the corresponding behavior in a given set of behaviors; features corresponding to multiple behaviors are called word-polysemy, and features belonging to one behavior are called synonyms; representing the audio and text feature dictionaries as sets L, respectivelyαAnd LτWherein the audio and text embed features fαAnd fτIn a corresponding form, the behavior s is significant;
the behavior characteristics of the audio modal learning network and the text modal learning network are initialized by querying a corresponding dictionary, the node characteristics are modeled by traversing all meanings, and the form of synonyms is queried from the dictionary, so that the audio modal learning network and the text modal learning network can infer the semantic relation between all modeled behaviors and the node characteristics.
10. A multimodal joint learning method for multi-line behavior recognition of video according to any of claims 5-9, wherein: the visual modal learning network, the audio modal learning network and the text modal learning network all adopt a relational graph convolution neural network with a two-layer structure.
CN202111143894.6A 2021-09-28 2021-09-28 Multi-mode joint learning method for video multi-behavior recognition Active CN113807307B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111143894.6A CN113807307B (en) 2021-09-28 2021-09-28 Multi-mode joint learning method for video multi-behavior recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111143894.6A CN113807307B (en) 2021-09-28 2021-09-28 Multi-mode joint learning method for video multi-behavior recognition

Publications (2)

Publication Number Publication Date
CN113807307A true CN113807307A (en) 2021-12-17
CN113807307B CN113807307B (en) 2023-12-12

Family

ID=78938904

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111143894.6A Active CN113807307B (en) 2021-09-28 2021-09-28 Multi-mode joint learning method for video multi-behavior recognition

Country Status (1)

Country Link
CN (1) CN113807307B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114299551A (en) * 2022-03-07 2022-04-08 深圳市海清视讯科技有限公司 Model training method, animal behavior identification method, device and equipment
CN117690098A (en) * 2024-02-01 2024-03-12 南京信息工程大学 Multi-label identification method based on dynamic graph convolution under open driving scene

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning
US20200356858A1 (en) * 2019-05-10 2020-11-12 Royal Bank Of Canada System and method for machine learning architecture with privacy-preserving node embeddings
CN113051927A (en) * 2021-03-11 2021-06-29 天津大学 Social network emergency detection method based on multi-modal graph convolutional neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200356858A1 (en) * 2019-05-10 2020-11-12 Royal Bank Of Canada System and method for machine learning architecture with privacy-preserving node embeddings
CN111246256A (en) * 2020-02-21 2020-06-05 华南理工大学 Video recommendation method based on multi-mode video content and multi-task learning
CN113051927A (en) * 2021-03-11 2021-06-29 天津大学 Social network emergency detection method based on multi-modal graph convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
齐金山;梁循;李志宇;陈燕方;许媛;: "大规模复杂信息网络表示学习:概念、方法与挑战", 计算机学报, no. 10, pages 222 - 248 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114299551A (en) * 2022-03-07 2022-04-08 深圳市海清视讯科技有限公司 Model training method, animal behavior identification method, device and equipment
CN117690098A (en) * 2024-02-01 2024-03-12 南京信息工程大学 Multi-label identification method based on dynamic graph convolution under open driving scene
CN117690098B (en) * 2024-02-01 2024-04-30 南京信息工程大学 Multi-label identification method based on dynamic graph convolution under open driving scene

Also Published As

Publication number Publication date
CN113807307B (en) 2023-12-12

Similar Documents

Publication Publication Date Title
CN108319686B (en) Antagonism cross-media retrieval method based on limited text space
CN107679580B (en) Heterogeneous migration image emotion polarity analysis method based on multi-mode depth potential correlation
Tang et al. Graph-based multimodal sequential embedding for sign language translation
CN112733533A (en) Multi-mode named entity recognition method based on BERT model and text-image relation propagation
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
Zhang et al. Semantic sentence embeddings for paraphrasing and text summarization
CN113807307B (en) Multi-mode joint learning method for video multi-behavior recognition
Barsever et al. Building a better lie detector with BERT: The difference between truth and lies
Ohishi et al. Trilingual semantic embeddings of visually grounded speech with self-attention mechanisms
CN114528411A (en) Automatic construction method, device and medium for Chinese medicine knowledge graph
CN113076483A (en) Case element heteromorphic graph-based public opinion news extraction type summarization method
CN112733764A (en) Method for recognizing video emotion information based on multiple modes
CN116561305A (en) False news detection method based on multiple modes and transformers
CN114969458A (en) Hierarchical self-adaptive fusion multi-modal emotion analysis method based on text guidance
CN115129934A (en) Multi-mode video understanding method
Huang et al. An effective multimodal representation and fusion method for multimodal intent recognition
CN114022687A (en) Image description countermeasure generation method based on reinforcement learning
CN113627550A (en) Image-text emotion analysis method based on multi-mode fusion
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
CN116977701A (en) Video classification model training method, video classification method and device
Pandey et al. Attention-based Model for Multi-modal sentiment recognition using Text-Image Pairs
CN113806545B (en) Comment text emotion classification method based on label description generation
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
CN114693949A (en) Multi-modal evaluation object extraction method based on regional perception alignment network
Liu et al. Attention-based convolutional LSTM for describing video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant