CN114863356B - Group activity identification method and system based on residual aggregation graph network - Google Patents

Group activity identification method and system based on residual aggregation graph network Download PDF

Info

Publication number
CN114863356B
CN114863356B CN202210236706.2A CN202210236706A CN114863356B CN 114863356 B CN114863356 B CN 114863356B CN 202210236706 A CN202210236706 A CN 202210236706A CN 114863356 B CN114863356 B CN 114863356B
Authority
CN
China
Prior art keywords
individual
representing
appearance
difference
branch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210236706.2A
Other languages
Chinese (zh)
Other versions
CN114863356A (en
Inventor
李威
吴晓
杨添朝
张基
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN202210236706.2A priority Critical patent/CN114863356B/en
Publication of CN114863356A publication Critical patent/CN114863356A/en
Application granted granted Critical
Publication of CN114863356B publication Critical patent/CN114863356B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and discloses a group activity identification method and a group activity identification system based on a residual aggregation graph network, wherein the method comprises the following steps: s1, extracting appearance characteristics; s2, double-branch reasoning; s3, weighting and fusing; and S4, predicting group activities. The invention solves the problems that video clips with similar individual actions but different group activity categories are difficult to effectively distinguish, importance screening on different semantic features is lacked and the like in the prior art.

Description

Group activity identification method and system based on residual aggregation graph network
Technical Field
The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and specifically relates to a group activity identification method and system based on a residual aggregation graph network.
Background
With the growing urban population and the dramatic increase in the flow of people in public areas, people monitoring and management face great challenges and pressures. If the video monitoring technology can be used for monitoring and alarming the abnormal group behavior phenomenon in the important area in time, relevant departments can take corresponding measures aiming at the early warning or alarming phenomenon in the shortest time to minimize the occurrence possibility of safety accidents and minimize the loss caused by the accidents. Therefore, more and more video monitoring systems are applied to public places to maintain public order and improve public area security, and group activity analysis is also receiving more and more attention.
The difficulty in group activity detection is that in addition to the need to master the actions of individuals, the potential connections between individuals in a group need to be mastered. Therefore, for better recognition of group activities, it is critical to utilize various information, such as appearance information, spatial location information, similarity relationship information, and difference information, among others.
For group activity recognition at present, the problem is mostly solved by a method based on a graph neural network. The method mainly comprises the following steps:
(1) and extracting appearance characteristics of each individual in a plurality of representative frames of the corresponding video clip through the basic network.
(2) And capturing the correlation among individuals in the group in a graph mode, and extracting the relation characteristics by graph convolution.
(3) And performing simple addition fusion and pooling operation on the individual appearance signs and the relationship characteristics to obtain video characteristics representing the whole video clip.
(4) And sending the video characteristics to a classifier to obtain a corresponding group activity classification result.
Such a method based on a graph neural network firstly ignores difference information (such as a small difference between close actions) existing between active persons in a video, which is very important for effectively distinguishing video segments with similar individual actions but different group activity categories; secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order reasoning characteristic, and the fusion mode lacks of importance screening of different semantic characteristics.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a group activity identification method and system based on a residual aggregation graph network, and solves the problems that video segments with similar individual actions but different group activity types are difficult to effectively distinguish, importance screening on different semantic features is lacked, and the like in the prior art.
The technical scheme adopted by the invention for solving the problems is as follows:
a group activity identification method based on a residual aggregation graph network comprises the following steps:
s1, appearance feature extraction: obtaining the appearance characteristics X of the individual level of the group to be identified by utilizing the key frames of the given video clip and the corresponding bounding boxes of each individual,
Figure GDA0003940402580000021
x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i denotes the number of individuals in the video clip key-frame, i =1,2, \8230;, T × N; x is a radical of a fluorine atom i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
s2, double-branch reasoning: performing difference relation reasoning on the appearance characteristics based on residual error aggregation to obtain difference characteristics
Figure GDA0003940402580000022
And respectively carrying out similarity relation reasoning on the appearance characteristics based on a graph neural network to obtain relation characteristics X';
s3, weighted fusion: weighting and fusing the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtainWeighted features
Figure GDA0003940402580000031
S4, predicting group activities: and performing pooling operation on the weighted features to obtain a global representation representing the whole video segment, further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.
As a preferred technical solution, in step S2, performing difference relation inference based on residual aggregation on the appearance features to obtain a formula of difference features, where the formula is:
Figure GDA0003940402580000032
wherein j represents the number of an individual in a video clip, j =1,2, \8230;, T × N;
Figure GDA0003940402580000033
representing the difference characteristic of the jth individual,
Figure GDA0003940402580000034
is a difference feature
Figure GDA0003940402580000035
Is a function of one of the elements of (a),
Figure GDA0003940402580000036
Figure GDA0003940402580000037
r i (x j ) Representing the residual relationship between individual j and individual i,
Figure GDA0003940402580000038
representing the spatial position correlation, x, between an individual j and an individual i j -x i Representing the difference in appearance characteristics between individual j and individual i in different time slots.
As a preferred embodiment, r i (x j ) The calculation formula of (2) is as follows: :
Figure GDA0003940402580000041
wherein, w j Representing a weight, b, mapping the apparent difference between two volumes centered on an individual j to a scalar j Offset value w representing a scalar quantity that maps appearance differences between two volumes centered on an individual j j ∈R D ,b j ∈R 1
As a preferred technical solution, it is proposed that,
Figure GDA0003940402580000042
the calculation formula of (c) is:
Figure GDA0003940402580000043
wherein Π (·) represents an indicator function,
Figure GDA0003940402580000044
denotes the euclidean distance between the individual j and the individual i, and μ denotes a spatial restriction factor.
As a preferred technical solution, in step S2, the appearance features are respectively subjected to similarity relationship inference based on a graph neural network to obtain a formula of relationship features, where the formula is as follows:
Figure GDA0003940402580000045
wherein g represents the number of the relational diagram, N g Representing the number of graphs, reLU () representing the ReLU activation function, G g Shows a relational graph with the number g, W g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.
As a preferred embodiment, G g The calculation formula of (2) is as follows:
G g ={G i,j ∈R 1 |i,j=1,...,T×N},
Figure GDA0003940402580000051
wherein G is i,j Representing the magnitude of the similarity relationship between the individual i and the individual j, f a (x i ,x j ) Representing the apparent correlation between individual i and individual j.
As a preferred solution, f a (x i ,x j ) The calculation formula of (2) is as follows:
Figure GDA0003940402580000052
wherein, theta (x) i ) Appearance feature x representing the appearance of an individual i i By D-dimension embedding
Figure GDA0003940402580000053
A linear transformation of the dimensional space is performed,
Figure GDA0003940402580000054
appearance feature x representing the individual j j By D-dimension embedding
Figure GDA0003940402580000055
Linear transformation of dimensional space, tanspose () representing a transpose operation, d k Indicating a normalization factor.
As a preferred technical solution, the step S3 includes the steps of,
s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:
Figure GDA0003940402580000061
wherein F represents an integerCombined characteristics, F ∈ R T×N×D B is the branch number; n is a radical of b Representing the number of branches, N in the above formula b Is 3,X i Representing appearance features, difference features or relationship features,
Figure GDA0003940402580000062
respectively represent different semantic features X,
Figure GDA0003940402580000063
And X';
s32, embedding global information in the channel direction by using a global average pooling layer and a full connection layer to generate channel statistical information, wherein the calculation mode is as follows:
Figure GDA0003940402580000064
wherein S represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,
Figure GDA0003940402580000071
f (n, t:) represents the characteristics of the nth individual of the t frame in F in the channel direction;
s33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and the softmax, wherein the calculation mode is as follows:
W b =softmax(w b S);
wherein, W b Weight vector of features branched into b, w b For the learnable linear transformation parameters of branch b that map the channel statistics S to weight vectors,
Figure GDA0003940402580000072
s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:
Figure GDA0003940402580000073
wherein,
Figure GDA0003940402580000074
representing weighted features
Figure GDA0003940402580000075
The value of the c-th channel of (c),
Figure GDA0003940402580000076
representing the weight of the feature in the c-th channel dimension of the b-th branch,
Figure GDA0003940402580000081
is W b The value of the c-th element;
Figure GDA0003940402580000082
a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,
Figure GDA0003940402580000083
is X b The value of the c-th element.
As a preferred technical solution, in step S4, the global representation is passed through a full connection layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.
The system applied to the group activity recognition method based on the residual aggregation map network comprises an appearance characteristic extraction module, a two-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance characteristic extraction module is also electrically connected with the weighted fusion module;
wherein;
appearance characteristic extraction module: the method is used for obtaining appearance characteristics X of the individual level of the group to be identified by utilizing key frames of the given video clip and the bounding box of each individual,
Figure GDA0003940402580000084
x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i denotes the number of individuals in the video clip key-frame, i =1,2, \8230;, T × N; x is the number of i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
a two-branch reasoning module: difference relation reasoning for performing residual error aggregation based on appearance characteristics to obtain difference characteristics
Figure GDA0003940402580000085
And performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
a weighted fusion module: the method is used for performing weighted fusion on the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics
Figure GDA0003940402580000091
A group activity prediction module: the method comprises the steps of performing a pooling operation on weighted features to obtain a global representation representing the whole video segment, then further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.
Compared with the prior art, the invention has the following beneficial effects:
(1) The method utilizes the difference information among the actors in the video, which is very important for effectively distinguishing the video segments which have similar individual actions but different group activity categories; the method for capturing the potential useful difference information in the group and the method for fusing different semantic features by self-adaptive weighting greatly improve the accuracy of group activity detection;
(2) The local residual error aggregation network module provided by the invention can encode potential differences among all related actors in a crowd and provide additional clues for reasoning;
(3) The weighted fusion strategy provided by the invention can adaptively select more important information in different semantic features.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a group activity recognition method based on a residual aggregation graph network according to the present invention;
FIG. 2 is a block diagram of a group activity recognition system based on a residual aggregation graph network according to the present invention;
FIG. 3 is a schematic view of a workflow of a group activity recognition system based on a residual aggregation graph network according to the present invention;
FIG. 4 is a schematic diagram of a local residual aggregation network module according to the present invention;
FIG. 5 is a schematic structural diagram of a weighted fusion module according to the present invention;
FIG. 6 is a flow chart of model training according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.
Example 1
As shown in fig. 1 to 6, the present invention provides a group activity recognition method and a model training method, which can detect an action category of a group from a video.
The appearance information and the space-time information of the individuals are required to be extracted for detecting the actions of the individuals in the video, and each individual has own action behavior in a real scene, so that the identification of group activities is required to extract the appearance information and the space-time information of each individual and model the potential information among the individuals so as to deduce the potential relationship among the groups relative to the detection of the actions of the individuals.
Most population-based methods based on graph neural networks first ignore the difference information that exists between the active people in the video (such as the slight difference between close actions); secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order relation characteristic, and the fusion mode lacks the importance screening of different semantic characteristics. The method for searching the mode capable of capturing the potential useful difference information in the group and the method for adaptively weighting and fusing different semantic features are very important for improving the accuracy of group activity detection.
In order to solve the above problems, the present invention provides a group activity identification method and system based on a residual aggregation graph network. The local residual error aggregation network module is used for capturing potential useful difference information in a population, and can be combined with the relation inference network module based on the graph neural network to form double-branch inference. The weighting fusion module is used for adaptively weighting and fusing different semantic features to achieve the effect of screening important information.
The main technical solution of the present invention will be described in detail with reference to specific examples.
1. The group activity detection method based on the graph neural network and the residual error aggregation network comprises the following steps:
1. extracting basic appearance characteristics:
the key frames of the given video clip and the corresponding bounding boxes of each individual are subjected to individual-level appearance characteristics by using a backbone network and Roialign, and the appearance characteristics are expressed as
Figure GDA0003940402580000111
Wherein x is i Representing the appearance characteristics, x, of each individual i ∈R D
2. The two-branch reasoning network module:
and inputting the obtained appearance characteristic X into a local residual error aggregation network module and a relation inference network module based on a graph neural network respectively to carry out double-branch high-order inference. These two networks will be described separately below.
(1) Local residual aggregation network module (LR) 2 M):
Inputting the obtained appearance characteristic into LR 2 M, modeling the difference information between individuals in the group to obtain the difference characteristics
Figure GDA0003940402580000112
(wherein
Figure GDA0003940402580000113
Is a difference characteristic
Figure GDA0003940402580000114
Is a function of one of the elements of (1),
Figure GDA0003940402580000115
representing the difference signature of the jth individual).
Figure GDA0003940402580000116
The calculation method of (c) is as follows:
Figure GDA0003940402580000117
wherein x is j -x i Representing the difference in appearance between the jth individual and the ith individual; r is a radical of hydrogen i (x j ) Representing the residual relationship between an individual j and an individual i, when there is a useful difference relationship between (j, i), r i (x j ) =1, when there is no useful difference relation r between (j, i) i (x j )=0;
Figure GDA0003940402580000121
Is a space limitation.
To make the above equation differentiable so as to be able to propagate backwards during network training, we will let r i (x j ) Performing smoothing on r i (x j ) The calculation is shown below:
Figure GDA0003940402580000122
wherein, w j Representing a weight, b, that maps the apparent difference between two volumes centered on an individual j as a scalar j Offset value, w, representing a scalar quantity that maps appearance differences between two volumes centered on an individual j j ∈R D ,b j ∈R 1
Previous research experiments have demonstrated that local information is more conducive to group activity category reasoning. Therefore, the local residual relation is aggregated by using a distance mask, and the calculation method is as follows:
Figure GDA0003940402580000123
wherein pi (·) is an indication function, μ is a space limiting factor, and is a hyper-parameter, and the specific value is determined according to the situation;
Figure GDA0003940402580000124
is the euclidean distance between the individual j and the individual i.
When in use
Figure GDA0003940402580000125
Figure GDA0003940402580000131
Finally, the product is processed
Figure GDA0003940402580000132
The calculation method of (c) is as follows:
Figure GDA0003940402580000133
group activities are inferred by modeling local differences between the active people in the group to obtain useful difference information in the video clip.
(2) The relation reasoning module based on the neural network of the graph:
for relational modeling, a graph neural network is adopted to establish an actor-relational graph, and relational information between actors can be provided for group activity recognition by utilizing a graph structure.
Each node in the actor-relationship graph represents each individual, while the importance of the relationship between two individuals is represented by the edge weights between the two actors. The weight of each edge in the graph is determined by the appearance characteristics and the spatial position of the individuals at the two ends. The method for calculating the weight of the edge between two body nodes in the method comprises the following steps:
Figure GDA0003940402580000134
wherein, G i,j Representing the magnitude of the similarity relationship between the individual i and the individual j, f a (x i ,x j ) Representing the apparent correlation between individual i and individual j.
For the appearance characteristic correlation, an embedded dot product method is adopted, and the corresponding formula is as follows:
Figure GDA0003940402580000141
wherein, theta (x) i ) Appearance feature x representing the appearance of an individual i i By D-dimension embedding
Figure GDA0003940402580000142
A linear transformation of the dimensional space is performed,
Figure GDA0003940402580000143
appearance feature x representing the individual j j By D-dimension embedding
Figure GDA0003940402580000144
Linear transformation of the dimensional space, tanspose () representing a transpose operation, d k Represents a normalization factor, d k Is a constant.
For spatial position correlation, the same distance masking approach as the above local residual aggregation network is adopted:
Figure GDA0003940402580000145
wherein Π (·) represents an indicator function,
Figure GDA0003940402580000146
denotes the euclidean distance between individual j and individual i and μ denotes the spatial limiting factor.
Thus, for a graph of relationships between two actors in a population, it can be expressed as:
G g ={G i,j ∈R 1 |i,j=1,...,T×W};
and meanwhile, a plurality of relation graphs are constructed to capture different related information. In the present invention, a series of graphs are built
Figure GDA0003940402580000147
Each graph is computed separately and does not share weights. The multiple graphs can be established, and the model can be operated to merge and learn different types of relationship information, so that the model can make more reliable relationship reasoning.
After graph construction, single-layer relationship reasoning is implemented using GCN. For the relationship features, they are calculated as follows:
Figure GDA0003940402580000151
wherein g represents the number of the relational diagram, N g Representing the number of graphs, reLU () representing the ReLU activation function, G g Shows a relational graph with the number g, W g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.
Difference characteristics are respectively obtained through the local residual error aggregation network module and the relation reasoning module based on the graph neural network
Figure GDA0003940402580000152
And a relation feature X' to form a two-branch reasoning network.
3. Weighted fusion module (WAS):
the appearance characteristics X and the differenceFeature(s)
Figure GDA0003940402580000153
And inputting the relation characteristic X' into the self-adaptive weighting fusion module, and performing weighting fusion on the three different semantic characteristics in the channel direction. The specific method comprises the following steps:
(1) firstly, integrating information of all branch characteristics, and obtaining the integrated characteristics through simple addition in element aspects, wherein the calculation mode is as follows:
Figure GDA0003940402580000154
wherein F represents the integrated characteristic, and F is epsilon of R T×N×D B is the branch number; n is a radical of b Representing the number of branches, N in the above formula b Is 3,X i Representing appearance features, difference features or relationship features,
Figure GDA0003940402580000161
respectively represent different semantic features X,
Figure GDA0003940402580000162
And X'.
(2) The global information is simply embedded in the channel direction using the global average pooling and full-link layers to generate channel statistics, calculated as follows:
Figure GDA0003940402580000163
wherein S represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,
Figure GDA0003940402580000164
f (n, t:) represents the characteristics of the nth individual of the t-th frame in F in the channel direction.
(3) The weights of the different branch features in the channel direction are obtained by simple full connectivity and softmax operations. The calculation method is as follows:
W b =softmax(w b S);
wherein, W b Weight vector of features branched into b, w b For the learnable linear transformation parameters of branch b that map the channel statistics S to weight vectors,
Figure GDA0003940402580000171
(4) finally, the calculation mode of each one-dimensional feature in the channel direction of the weighted and fused feature is as follows:
Figure GDA0003940402580000172
wherein,
Figure GDA0003940402580000173
representing weighted features
Figure GDA0003940402580000174
The value of the c-th channel of (c),
Figure GDA0003940402580000175
representing the weight of the feature in the c-th channel dimension of the b-th branch,
Figure GDA0003940402580000176
is W b The value of the c-th element;
Figure GDA0003940402580000177
a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,
Figure GDA0003940402580000178
is X b The value of the c-th element.
Through the WAS, the features under different semantics are subjected to adaptive weighted fusion, and information more important for group activity identification is screened out.
4. By applying the above-obtained weighting characteristics
Figure GDA0003940402580000179
A pooling operation is performed resulting in a global representation representing the entire video segment. And (4) obtaining the confidence of each group activity category of each frame by a simple full connection layer and then by softmax operation through the global representation. And predicting the group activity category by using the average value of the confidence degrees of the categories of each frame.
2. The model training method for group detection based on the graph neural network and the residual error aggregation network comprises the following steps:
1. and acquiring a video clip sample and a label corresponding to the sample, wherein the label represents the group activity of each key frame in the training sample.
2. Dividing the sample and the label thereof into two parts according to a proportion, wherein one part is a training set and is used for training the model; and a part is a verification set used for selecting the model.
3. And pre-training the backbone network by using the processed training set.
4. And (3) processing the samples in the training set, outputting a prediction result through a model, and calculating the loss of the prediction result and the real label by using cross loss entropy.
5. And training the model through back propagation and parameter updating, and performing reasoning test by using a verification set so as to select a better model result.
Example 2
As shown in fig. 1 to fig. 6, as a further optimization of the embodiment 1, this embodiment includes all the technical features of the embodiment 1, and in addition, this embodiment also includes the following technical features:
firstly, inputting a key frame of a given video clip and a corresponding bounding box of each individual into a backbone network by utilizing the backbone network and Roialign to obtain the appearance characteristics of the individual; secondly, inputting the appearance characteristics into a local residual error aggregation network module and a relation reasoning module based on a graph neural network to obtain corresponding difference characteristics and relation characteristics; then, performing self-adaptive weighted fusion on the appearance characteristic, the difference characteristic and the relation characteristic through a weighted fusion module to obtain a fused characteristic; and finally, performing pooling operation on the fused features to obtain video global representation, and inputting the video global representation into a classifier to obtain a final classification result.
In the local residual error aggregation network module, the appearance characteristics of each individual are input into the local residual error aggregation module, residual errors between every two individuals and residual error correlation coefficients corresponding to the residual errors are respectively calculated, and finally the difference characteristics of each individual are calculated under the constraint of spatial positions.
In the task of group action recognition, we put the proposed solution on two reference datasets: the Volleyball dataset and the Collective Activity dataset were compared to the most advanced methods. And we used two indices to evaluate model accuracy, MCA (multi-class classification accuracy) and MPCA (average per class accuracy), respectively.
For the Volleyball data set, the MCA is improved by 2.6% on the Volleyball data set through a proposed two-branch inference mode formed by a local residual error network module and a graph neural network-based relation inference module and a proposed weighting fusion module on the basis of ARG. Compared with the advanced method DIN based on the neural network of the graph, the method provided by the invention respectively improves MCA and MPCA by 0.9% and 1.2% under the condition of taking VGG16 as the backbone.
For the Collective Activity dataset, our proposed method improved MCA by 5.1% with VGG16 as backbone and ARG as baseline. Compared with advanced methods DIN based on a graph neural network, the method respectively takes RestNet18 and VGG16 as backbones, and the method is respectively improved by 0.8 percent and 0.6 percent in MPCA
In this method we take 3 representative frames from each video segment as input and clip each frame of the Volleyball dataset to a size of 720 × 1280 and each frame of the Collective Activity dataset to a size of 480 × 720. The method adopts RestNet18 or VGG16 as a backbone network. Adaptive momentum estimation (Adam) optimizerIn a training model, wherein 1 =0.9,β 2 =0.999,∈=10 -8 . For the Volleyball dataset, the initial learning rate is set to 1e -4 The learning rate is updated every 10 epochs at a decay rate of 0.3, with a training period of 40. For the Collective Activity dataset, the RestNet18 or VGG16 learning rates are set to 4e for -5 And 1e -4 The training period is 30. The spatial limiting factor of the local residual aggregation network module is set to a graph width of 0.2 and 0.3 in the Volleyball data set and the Collective Activity data set, respectively. Canonical factor d of relational inference module based on graph neural network k Set to 256. The dimension shrinkage factor of the weighted fusion module is set to 16. The batch size of both datasets was set to 2 at training.
The accuracy on these two reference data sets reflects the advancement of our method. For analysis reasons, the method can have two main advantages: (1) The local residual error aggregation network module provided by the method can encode potential differences among all related actors in the crowd and provide additional clues for reasoning; (2) The weighting fusion strategy provided by the method can adaptively select more important information in different semantic characteristic values.
As described above, the present invention can be preferably realized.
All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims (8)

1. A group activity identification method based on a residual aggregation graph network is characterized by comprising the following steps:
s1, appearance feature extraction: obtaining the appearance characteristics X of the individual level of the group to be identified by utilizing the key frames of the given video clip and the bounding boxes of each individual,
Figure FDA0003977786120000011
x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i represents the number of individuals in the key frame of the video segment, i =1,2, \8230;, T × N; x is a radical of a fluorine atom i Representing the appearance characteristics of individuals numbered i in the video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
s2, double-branch reasoning: performing difference relation reasoning on the appearance characteristics based on residual error aggregation to obtain difference characteristics
Figure FDA0003977786120000012
And performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
s3, weighted fusion: weighting and fusing the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics
Figure FDA0003977786120000013
S4, predicting group activities: pooling the weighted features to obtain a global representation representing the whole video segment, further processing to obtain a confidence coefficient of each frame for the group activity category, and predicting the group activity category by using an average value of the confidence coefficients of each category of each frame;
in step S2, the appearance features are subjected to difference relationship inference based on residual aggregation to obtain a difference feature formula as follows:
Figure FDA0003977786120000021
wherein j represents the number of an individual in a video clip, j =1,2, \8230;, T × N;
Figure FDA0003977786120000022
representing the difference characteristic of the jth individual,
Figure FDA0003977786120000023
is a difference characteristic
Figure FDA0003977786120000024
Is a function of one of the elements of (1),
Figure FDA0003977786120000025
Figure FDA0003977786120000026
r i (x j ) Representing the residual relationship between individual j and individual i,
Figure FDA0003977786120000027
representing the spatial position correlation, x, between an individual j and an individual i j -x i Representing the difference in appearance characteristics between an individual j and an individual i in different sky;
the step S3 comprises the following steps,
s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:
Figure FDA0003977786120000028
wherein F represents the integrated characteristic, and F is epsilon of R T×N×D B is the branch number; n is a radical of b Representing the number of branches, N in the above formula b Is 3,X l Representing appearance features, difference features or relationship features,
Figure FDA0003977786120000031
respectively represent different semantic features X,
Figure FDA0003977786120000032
And X';
s32, embedding global information in the channel direction by using a global average pooling layer and a full connection layer to generate channel statistical information, wherein the calculation mode is as follows:
Figure FDA0003977786120000033
where s represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,
Figure FDA0003977786120000034
f (n, t:) represents the characteristics of the nth individual in the t frame in the F in the channel direction;
s33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and the softmax, wherein the calculation mode is as follows:
W b =softmax(w b S);
wherein, W b Weight vector of features branched into b, w b For branch b the mathematically linear transformation parameters that map the channel statistics S to weight vectors,
Figure FDA0003977786120000041
s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:
Figure FDA0003977786120000042
wherein,
Figure FDA0003977786120000043
representing weighted features
Figure FDA0003977786120000044
The value of the c-th channel of (c),
Figure FDA0003977786120000045
representing the weight of the feature in the c channel dimension of the b-th branch,
Figure FDA0003977786120000046
is W b The value of the c-th element;
Figure FDA0003977786120000047
a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,
Figure FDA0003977786120000048
is X b The value of the c-th element.
2. The method of claim 1, wherein r is the group activity recognition method based on the residual aggregation graph network i (x j ) The calculation formula of (2) is as follows:
Figure FDA0003977786120000051
wherein, w j Representing a weight, b, that maps the apparent difference between two volumes centered on an individual j as a scalar j Offset value, w, representing a scalar quantity that maps the apparent difference between two volumes centered on an individual j j ∈R D ,b j ∈R 1
3. The method according to claim 2, wherein the group activity recognition method based on the residual aggregation graph network,
Figure FDA0003977786120000052
the calculation formula of (c) is:
Figure FDA0003977786120000053
wherein pi (·) represents an indicator function,
Figure FDA0003977786120000054
denotes the euclidean distance between individual j and individual i and μ denotes the spatial limiting factor.
4. The group activity recognition method based on the residual aggregation graph network as claimed in claim 1, wherein in step S2, the appearance features are respectively subjected to graph neural network-based similarity relationship inference to obtain a relationship feature formula as follows:
Figure FDA0003977786120000061
wherein g represents the number of the relational diagram, N g Representing the number of graphs, reLU () representing the ReLU activation function, G g Shows a relational graph with the number g, W g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.
5. The method of claim 4, wherein G is a group activity recognition method based on the residual aggregation graph network g The calculation formula of (2) is as follows:
G g ={G i,j ∈R 1 |i,j=1,...,T×N},
Figure FDA0003977786120000062
wherein G is i,j Representing the magnitude of the similarity relationship between the individual i and the individual j, f a (x i ,x j ) Representing the apparent correlation between individual i and individual j.
6. The method according to claim 5, wherein f is a group activity recognition method based on a residual aggregation graph network a (x i ,x j ) The calculation formula of (2) is as follows:
Figure FDA0003977786120000071
wherein, theta (x) i ) Appearance characteristic x representing the individual i i By D-dimension embedding
Figure FDA0003977786120000072
A linear transformation of the dimensional space is performed,
Figure FDA0003977786120000073
appearance feature x representing the individual j j By D-dimension embedding
Figure FDA0003977786120000074
Linear transformation of dimensional space, tansp. se () represents a transpose operation, d k Indicating a normalization factor.
7. The method of claim 6, wherein in step S4, the global representation is passed through a full link layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.
8. The system applied to the group activity identification method based on the residual aggregation map network is characterized by comprising an appearance feature extraction module, a double-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance feature extraction module is also electrically connected with the weighted fusion module;
wherein;
appearance feature extraction module: the method is used for obtaining appearance characteristics X of the individual level of the group to be identified by utilizing key frames of the given video clip and the bounding box of each individual,
Figure FDA0003977786120000075
x i ∈R D wherein, T represents the frame number of the key frame of the video clip, and N represents the number of each key frame in the video clip; i represents the number of individuals in the key frame of the video segment, i =1,2, \8230;, T × N; x is the number of i Representing the appearance characteristics of individuals numbered i in the video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
a two-branch reasoning module: difference relation reasoning for performing residual error aggregation based on appearance characteristics to obtain difference characteristics
Figure FDA0003977786120000081
And performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
a weighted fusion module: the method is used for performing weighted fusion on the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics
Figure FDA0003977786120000082
A group activity prediction module: performing pooling operation on the weighted features to obtain a global representation representing the whole video segment, further processing to obtain confidence of each frame to the group activity category, and predicting the group activity category by using an average value of the confidence of each category of each frame;
when the dual-branch reasoning module works, the appearance characteristics are subjected to difference relation reasoning based on residual error aggregation to obtain a formula of difference characteristics, wherein the formula is as follows:
Figure FDA0003977786120000083
wherein j represents the number of an individual in a video clip, j =1,2, \8230;, T × N;
Figure FDA0003977786120000084
representing the difference characteristic of the jth individual,
Figure FDA0003977786120000085
is a difference feature
Figure FDA0003977786120000086
Is a function of one of the elements of (a),
Figure FDA0003977786120000087
Figure FDA0003977786120000088
r i (x j ) Representing the residual relationship between individual j and individual i,
Figure FDA0003977786120000091
representing the spatial position correlation, x, between an individual j and an individual i j -x i Representing the difference of appearance characteristics between an individual j and an individual i in different sky;
the weighted fusion module performs the following steps,
s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:
Figure FDA0003977786120000092
wherein F represents the integrated characteristic, and F is equal to R T×N×D B is the branch number; n is a radical of hydrogen b Representing the number of branches, N in the above formula b Is 3,X l Indicating appearance, difference, or offIs characterized in that the first and second liquid crystal panels,
Figure FDA0003977786120000093
respectively represent different semantic features X,
Figure FDA0003977786120000094
And X';
s32, embedding global information in the channel direction by using a global average pooling layer and a full connection layer to generate channel statistical information, wherein the calculation mode is as follows:
Figure FDA0003977786120000095
wherein S represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,
Figure FDA0003977786120000101
f (n, t:) represents the characteristics of the nth individual in the t frame in the F in the channel direction;
s33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and the softmax, wherein the calculation mode is as follows:
W b =softmax(W b S);
wherein, W b Weight vector of features branched into b, w b For branch b the mathematically linear transformation parameters that map the channel statistics S to weight vectors,
Figure FDA0003977786120000102
s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:
Figure FDA0003977786120000103
wherein,
Figure FDA0003977786120000111
representing weighted features
Figure FDA0003977786120000112
The value of the c-th channel of (c),
Figure FDA0003977786120000113
representing the weight of the feature in the c channel dimension of the b-th branch,
Figure FDA0003977786120000114
is W b The value of the c-th element;
Figure FDA0003977786120000115
a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,
Figure FDA0003977786120000116
is X b The value of the c-th element.
CN202210236706.2A 2022-03-10 2022-03-10 Group activity identification method and system based on residual aggregation graph network Active CN114863356B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210236706.2A CN114863356B (en) 2022-03-10 2022-03-10 Group activity identification method and system based on residual aggregation graph network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210236706.2A CN114863356B (en) 2022-03-10 2022-03-10 Group activity identification method and system based on residual aggregation graph network

Publications (2)

Publication Number Publication Date
CN114863356A CN114863356A (en) 2022-08-05
CN114863356B true CN114863356B (en) 2023-02-03

Family

ID=82627853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210236706.2A Active CN114863356B (en) 2022-03-10 2022-03-10 Group activity identification method and system based on residual aggregation graph network

Country Status (1)

Country Link
CN (1) CN114863356B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401174A (en) * 2020-03-07 2020-07-10 北京工业大学 Volleyball group behavior identification method based on multi-mode information fusion
WO2021032295A1 (en) * 2019-08-21 2021-02-25 Toyota Motor Europe System and method for detecting person activity in video
CN112613349A (en) * 2020-12-04 2021-04-06 北京理工大学 Time sequence action detection method and device based on deep hybrid convolutional neural network
CN112699786A (en) * 2020-12-29 2021-04-23 华南理工大学 Video behavior identification method and system based on space enhancement module
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN113435430A (en) * 2021-08-27 2021-09-24 中国科学院自动化研究所 Video behavior identification method, system and equipment based on self-adaptive space-time entanglement
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11210523B2 (en) * 2020-02-06 2021-12-28 Mitsubishi Electric Research Laboratories, Inc. Scene-aware video dialog
CN112434608B (en) * 2020-11-24 2023-02-28 山东大学 Human behavior identification method and system based on double-current combined network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021032295A1 (en) * 2019-08-21 2021-02-25 Toyota Motor Europe System and method for detecting person activity in video
CN111401174A (en) * 2020-03-07 2020-07-10 北京工业大学 Volleyball group behavior identification method based on multi-mode information fusion
CN112613349A (en) * 2020-12-04 2021-04-06 北京理工大学 Time sequence action detection method and device based on deep hybrid convolutional neural network
CN112699786A (en) * 2020-12-29 2021-04-23 华南理工大学 Video behavior identification method and system based on space enhancement module
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN113435430A (en) * 2021-08-27 2021-09-24 中国科学院自动化研究所 Video behavior identification method, system and equipment based on self-adaptive space-time entanglement
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Coherence Constrained Graph LSTM;Jinhui Tang et.al;《IEEE》;20190715;636 - 647 *
基于监控视频的群体行为识别算法研究;张乐军;《中国优秀硕士学位论文全文数据库信息科技辑》;20190715;全文 *
基于行为特征优化和深度学习的行为识别研究;熊辛;《中国博士学位论文全文数据库信息科技辑》;20220215;45-87页 *

Also Published As

Publication number Publication date
CN114863356A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN110084151B (en) Video abnormal behavior discrimination method based on non-local network deep learning
CN109902564B (en) Abnormal event detection method based on structural similarity sparse self-coding network
CN109767312B (en) Credit evaluation model training and evaluation method and device
CN111737592B (en) Recommendation method based on heterogeneous propagation collaborative knowledge sensing network
CN109522961B (en) Semi-supervised image classification method based on dictionary deep learning
CN109344285A (en) A kind of video map construction and method for digging, equipment towards monitoring
CN114692741B (en) Generalized face counterfeiting detection method based on domain invariant features
CN111292195A (en) Risk account identification method and device
CN112765370B (en) Entity alignment method and device of knowledge graph, computer equipment and storage medium
Chi et al. A decision support system for detecting serial crimes
CN112288034A (en) Semi-supervised online anomaly detection method for wireless sensor network
CN111598032B (en) Group behavior recognition method based on graph neural network
CN117591813B (en) Complex equipment fault diagnosis method and system based on multidimensional features
CN118171171A (en) Heat supply pipe network fault diagnosis method based on graphic neural network and multidimensional time sequence data
CN113343123A (en) Training method and detection method for generating confrontation multiple relation graph network
CN114863356B (en) Group activity identification method and system based on residual aggregation graph network
CN115761654B (en) Vehicle re-identification method
CN116304906A (en) Trusted graph neural network node classification method
CN116257786A (en) Asynchronous time sequence classification method based on multi-element time sequence diagram structure
CN113312968A (en) Real anomaly detection method in surveillance video
CN112927248A (en) Point cloud segmentation method based on local feature enhancement and conditional random field
CN117828280B (en) Intelligent fire information acquisition and management method based on Internet of things
Bai et al. Neural ordinary differential equation model for evolutionary subspace clustering and its applications
CN118133190B (en) Load identification model construction method and load identification method based on BN relation network
CN116543416A (en) Unsupervised pedestrian re-identification method integrating relation features and content features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant