CN114863356A - Group activity identification method and system based on residual aggregation graph network - Google Patents

Group activity identification method and system based on residual aggregation graph network Download PDF

Info

Publication number
CN114863356A
CN114863356A CN202210236706.2A CN202210236706A CN114863356A CN 114863356 A CN114863356 A CN 114863356A CN 202210236706 A CN202210236706 A CN 202210236706A CN 114863356 A CN114863356 A CN 114863356A
Authority
CN
China
Prior art keywords
individual
representing
appearance
group activity
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210236706.2A
Other languages
Chinese (zh)
Other versions
CN114863356B (en
Inventor
李威
吴晓
杨添朝
张基
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Jiaotong University
Original Assignee
Southwest Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Jiaotong University filed Critical Southwest Jiaotong University
Priority to CN202210236706.2A priority Critical patent/CN114863356B/en
Publication of CN114863356A publication Critical patent/CN114863356A/en
Application granted granted Critical
Publication of CN114863356B publication Critical patent/CN114863356B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and discloses a group activity identification method and a group activity identification system based on a residual aggregation graph network, wherein the method comprises the following steps: s1, extracting appearance features; s2, performing double-branch reasoning; s3, weighted fusion; and S4, predicting the group activity. The invention solves the problems that video clips with similar individual actions but different group activity categories are difficult to effectively distinguish, importance screening on different semantic features is lacked and the like in the prior art.

Description

Group activity identification method and system based on residual aggregation graph network
Technical Field
The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and specifically relates to a group activity identification method and system based on a residual aggregation graph network.
Background
With the growing urban population and the dramatic increase in the flow of people in public areas, people monitoring and management face great challenges and pressures. If the video monitoring technology can be used for monitoring and alarming the abnormal group behavior phenomenon in the important area in time, relevant departments can take corresponding measures aiming at the early warning or alarming phenomenon in the shortest time to minimize the occurrence possibility of safety accidents and minimize the loss caused by the accidents. Therefore, more and more video monitoring systems are applied to public places to maintain public order and improve public area security, and group activity analysis is also receiving more and more attention.
The difficulty of group activity detection is that potential connections between individuals in a group need to be mastered in addition to the actions of the individuals. Therefore, for better recognition of group activities, it is critical to utilize various information, such as appearance information, spatial location information, similarity relationship information, and difference information, among others.
For group activity recognition at present, the problem is mostly solved by a method based on a graph neural network. The method mainly comprises the following steps:
extracting appearance characteristics of each individual in a plurality of representative frames of the corresponding video clip through a basic network.
Secondly, capturing the correlation among individuals in the group in a graph mode, and extracting the relation characteristics by graph convolution.
And thirdly, carrying out simple addition fusion and pooling operation on the individual appearance signs and the relationship characteristics to obtain the video characteristics representing the whole video clip.
And fourthly, sending the video features into a classifier to obtain a corresponding group activity classification result.
Such a method based on a graph neural network firstly ignores difference information (such as a slight difference between close actions) existing between active persons in a video, which is very important for effectively distinguishing video segments with similar individual actions but different group activity categories; secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order reasoning characteristic, and the fusion mode lacks importance screening of different semantic characteristics.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a group activity identification method and system based on a residual aggregation graph network, and solves the problems that video segments with similar individual actions but different group activity types are difficult to effectively distinguish, importance screening on different semantic features is lacked, and the like in the prior art.
The technical scheme adopted by the invention for solving the problems is as follows:
a group activity identification method based on a residual aggregation graph network comprises the following steps:
s1, extracting appearance features: obtaining the appearance characteristics X of the individual level of the group to be identified by utilizing the key frames of the given video clip and the corresponding bounding boxes of each individual,
Figure BDA0003540201780000021
x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i represents the number of an individual in the video clip key frame, i is 1, 2, …, T × N; x is the number of i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
s2, two-branch reasoning: performing difference relation reasoning on the appearance characteristics based on residual error aggregation to obtain difference characteristics
Figure BDA0003540201780000022
And performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
s3, weighted fusion: will look like the characteristicCarrying out weighted fusion on the difference characteristic and the relation characteristic in the channel direction to obtain a weighted characteristic
Figure BDA0003540201780000031
S4, group activity prediction: and performing pooling operation on the weighted features to obtain a global representation representing the whole video segment, further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.
As a preferable technical solution, in step S2, the appearance feature is subjected to difference relation inference based on residual error aggregation to obtain a difference feature formula as follows:
Figure BDA0003540201780000032
where j denotes the number of an individual in a video clip, j ═ 1, 2, …, txn;
Figure BDA0003540201780000033
representing the difference characteristic of the jth individual,
Figure BDA0003540201780000034
is a difference characteristic
Figure BDA0003540201780000035
Is a function of one of the elements of (1),
Figure BDA0003540201780000036
Figure BDA0003540201780000037
r i (x j ) Representing the residual relationship between individual j and individual i,
Figure BDA0003540201780000038
representing the spatial position correlation, x, between an individual j and an individual i j -x i Representing different time of flightThe difference in appearance characteristics between individual j and individual i.
As a preferred embodiment, r i (x j ) The calculation formula of (2) is as follows:
Figure BDA0003540201780000041
wherein, w j Representing a weight, b, mapping the apparent difference between two volumes centered on an individual j to a scalar j Offset value, w, representing a scalar quantity that maps appearance differences between two volumes centered on an individual j j ∈R D ,b j ∈R 1
As a preferred technical solution, it is proposed that,
Figure BDA0003540201780000042
the calculation formula of (2) is as follows:
Figure BDA0003540201780000043
wherein Π (·) represents an indicator function,
Figure BDA0003540201780000044
denotes the euclidean distance between individual j and individual i and μ denotes the spatial limiting factor.
As a preferable technical solution, in step S2, the appearance features are respectively subjected to similarity relationship inference based on the neural network of the graph to obtain a formula of relationship features:
Figure BDA0003540201780000045
wherein g represents the number of the relational diagram, N g Representing the number of graphs, ReLU () representing the ReLU activation function, G g Shows a relationship diagram with the number g, W g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.
As a preferred embodiment, G g Is calculated from the equation:
G g ={G i,j ∈R 1 |i,j=1,...,T×N},
Figure BDA0003540201780000051
wherein G is i,j Representing the magnitude of the similarity relationship between the individual i and the individual j, f a (x i ,x j ) Representing the apparent correlation between individual i and individual j.
As a preferred solution, f a (x i ,x j ) Is calculated from the equation:
Figure BDA0003540201780000052
wherein, theta (x) i ) Appearance feature x representing the appearance of an individual i i By D-dimension embedding
Figure BDA0003540201780000053
A linear transformation of the dimensional space is performed,
Figure BDA0003540201780000061
appearance feature x representing the individual j j By D-dimension embedding
Figure BDA0003540201780000062
Linear transformation of dimensional space, Tanspose () representing a transpose operation, d k Indicating a normalization factor.
As a preferable technical solution, the step S3 includes the steps of,
s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:
Figure BDA0003540201780000063
wherein F represents the integrated characteristic, and F is equal to R T×N×D B is the branch number; n is a radical of b Representing the number of branches, N in the above formula b Is 3, X i Representing appearance features, difference features or relationship features,
Figure BDA0003540201780000064
respectively represent different semantic features X,
Figure BDA0003540201780000065
And X';
s32, embedding global information in the channel direction by using the global average pooling and the full connection layer to generate channel statistical information, wherein the calculation method is as follows:
Figure BDA0003540201780000066
wherein S represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,
Figure BDA0003540201780000071
f (n, t:) represents the characteristics of the nth individual of the t-th frame in F in the channel direction.
S33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and softmax, wherein the calculation mode is as follows:
W b =softmax(w b S);
wherein, W b Weight vector of features branched into b, w b For the learnable linear transformation parameters of branch b that map the channel statistics S to weight vectors,
Figure BDA0003540201780000072
s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:
Figure BDA0003540201780000073
wherein the content of the first and second substances,
Figure BDA0003540201780000081
representing weighted features
Figure BDA0003540201780000082
The value of the c-th channel of (c),
Figure BDA0003540201780000083
representing the weight of the feature in the c channel dimension of the b-th branch,
Figure BDA0003540201780000084
is W b The value of the c-th element;
Figure BDA0003540201780000085
a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,
Figure BDA0003540201780000086
is X b The value of the c-th element.
As a preferred technical solution, in step S4, the global representation is passed through a full connection layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.
The system applied to the group activity recognition method based on the residual aggregation map network comprises an appearance characteristic extraction module, a two-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance characteristic extraction module is also electrically connected with the weighted fusion module;
wherein;
appearance characteristic extraction module: the method is used for obtaining appearance characteristics X of the individual level of the group to be identified by utilizing key frames of the given video clip and the bounding box of each individual,
Figure BDA0003540201780000087
x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i represents the number of an individual in the video clip key frame, i is 1, 2, …, T × N; x is the number of i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
a two-branch reasoning module: difference relation reasoning for performing residual error aggregation based on appearance characteristics to obtain difference characteristics
Figure BDA0003540201780000091
And performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
a weighted fusion module: the method is used for performing weighted fusion on the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics
Figure BDA0003540201780000092
A group activity prediction module: the method comprises the steps of performing a pooling operation on weighted features to obtain a global representation representing the whole video segment, then further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.
Compared with the prior art, the invention has the following beneficial effects:
(1) the method utilizes the difference information among the actors in the video, which is very important for effectively distinguishing the video segments which have similar individual actions but different group activity categories; by means of a method capable of capturing potential useful difference information in a group and a self-adaptive method for weighting and fusing different semantic features, the group activity detection precision is greatly improved;
(2) the local residual error aggregation network module provided by the invention can encode potential differences among all related actors in a crowd and provide additional clues for reasoning;
(3) the weighted fusion strategy provided by the invention can adaptively select more important information in different semantic features.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a group activity recognition method based on a residual aggregation graph network according to the present invention;
FIG. 2 is a block diagram of a group activity recognition system based on a residual aggregation graph network according to the present invention;
FIG. 3 is a schematic view of a workflow of a group activity recognition system based on a residual aggregation graph network according to the present invention;
FIG. 4 is a schematic structural diagram of a local residual aggregation network module according to the present invention;
FIG. 5 is a schematic structural diagram of a weighted fusion module according to the present invention;
FIG. 6 is a flow chart of model training according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.
Example 1
As shown in fig. 1 to 6, the present invention provides a group activity recognition method and a model training method, which can detect an action category of a group from a video.
The detection of the motion of an individual in a video requires extraction of appearance information and spatio-temporal information of the individual, and in a real scene, each individual has own motion behavior, so that the identification of group activities is performed by extracting the appearance information and the spatio-temporal information of each individual and modeling potential information among the individuals so as to deduce the potential relationship among groups.
Most population-based methods based on graph neural networks first ignore the difference information that exists between the active people in the video (such as the slight difference between close actions); secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order relation characteristic, and the fusion mode lacks the importance screening of different semantic characteristics. The method for searching the mode capable of capturing the potential useful difference information in the group and the method for adaptively weighting and fusing different semantic features are very important for improving the accuracy of group activity detection.
In order to solve the above problems, the present invention provides a group activity identification method and system based on a residual aggregation graph network. The local residual error aggregation network module is used for capturing potential useful difference information in a population, and can be combined with a relational inference network module based on a graph neural network to form double-branch inference. The weighting fusion module is used for adaptively weighting and fusing different semantic features to achieve the effect of screening important information.
The main technical solution of the present invention will be described in detail with reference to specific examples.
A group activity detection method based on a graph neural network and a residual error aggregation network comprises the following steps:
1. extracting basic appearance characteristics:
key frames of a given video segment and corresponding bounding boxes of each individual are subjected to individual-level appearance characteristics by using a backbone network and Roiliaign, and the appearance characteristics are expressed as
Figure BDA0003540201780000111
Wherein x is i Representing the appearance characteristics, x, of each individual i ∈R D
2. The two-branch reasoning network module:
and respectively inputting the obtained appearance characteristics X into a local residual error aggregation network module and a relation inference network module based on a graph neural network to carry out double-branch high-order inference. These two networks will be described separately below.
Partial residual aggregation network module (LR) 2 M):
Inputting the obtained appearance characteristic into LR 2 M, modeling difference information among individuals in a group to obtain difference characteristics
Figure BDA0003540201780000112
(wherein
Figure BDA0003540201780000113
Figure BDA0003540201780000114
Is a difference characteristic
Figure BDA0003540201780000115
Is a function of one of the elements of (a),
Figure BDA0003540201780000116
representing the difference signature of the jth individual).
Figure BDA0003540201780000117
The calculation method of (c) is as follows:
Figure BDA0003540201780000118
wherein x is j -x i Representing the difference in appearance between the jth individual and the ith individual; r is i (x j ) Representing the residual relationship between an individual j and an individual i, when there is a useful difference relationship between (j, i), r i (x j ) 0, when there is no useful difference relation r between (j, i) i (x j )=1;
Figure BDA0003540201780000121
Is a space limitation.
To make the above equation differentiable so as to be able to propagate backwards during network training, we will let r i (x j ) Performing smoothing on r i (x j ) The calculation is shown below:
Figure BDA0003540201780000122
wherein, w j Meaning that between two bodies, the individual j will be centeredIs mapped as a scalar weight, b j Offset value, w, representing a scalar quantity that maps appearance differences between two volumes centered on an individual j j ∈R D ,b j ∈R 1
Previous research experiments have demonstrated that local information is more favorable for group activity category reasoning. Therefore, the local residual relation is aggregated by using a distance mask, and the calculation method is as follows:
Figure BDA0003540201780000123
wherein pi (·) is an indication function, μ is a space limiting factor, and is a hyper-parameter, and the specific value is determined according to the situation;
Figure BDA0003540201780000124
is the euclidean distance between individual j and individual i.
When in use
Figure BDA0003540201780000131
Figure BDA0003540201780000132
Finally, the product is processed
Figure BDA0003540201780000133
The calculation method of (c) is as follows:
Figure BDA0003540201780000134
group activities are inferred by modeling local differences between the activities in the group as described above to obtain useful difference information in the video segments.
A relation reasoning module based on the neural network of the graph:
for relational modeling, a graph neural network is adopted to establish an actor-relational graph, and relational information between actors can be provided for group activity recognition by utilizing a graph structure.
Each node in the actor-relationship graph represents each individual, and the importance of the relationship between two individuals is represented by the edge weight between the two actors. The weight of each edge in the graph is determined by the appearance characteristics and the spatial position of the individuals at the two ends. The method for calculating the weight of the edge between two body nodes in the method comprises the following steps:
Figure BDA0003540201780000135
wherein, G i,j Representing the magnitude of the similarity relationship between the individual i and the individual j, f a (x i ,x j ) Representing the apparent correlation between individual i and individual j.
For the appearance characteristic correlation, an embedded dot product method is adopted, and the corresponding formula is as follows:
Figure BDA0003540201780000141
wherein, theta (x) i ) Appearance feature x representing the appearance of an individual i i By D-dimension embedding
Figure BDA0003540201780000142
A linear transformation of the dimensional space is performed,
Figure BDA0003540201780000143
appearance feature x representing the individual j j By D-dimension embedding
Figure BDA0003540201780000144
Linear transformation of the dimensional space, Tanspose () representing a transpose operation, d k Denotes a normalization factor, d k Is a constant.
For spatial position correlation, the same distance masking method as the local residual aggregation network is adopted:
Figure BDA0003540201780000145
wherein Π (·) represents an indicator function,
Figure BDA0003540201780000146
denotes the euclidean distance between individual j and individual i and μ denotes the spatial limiting factor.
Thus, for a graph of relationships between two actors in a population, it can be expressed as:
G g ={G i,j ∈R 1 |i,j=1,...,T×N};
and meanwhile, a plurality of relation graphs are constructed to capture different related information. In the present invention, a series of graphs are built
Figure BDA0003540201780000154
Each graph is computed separately and does not share weights. The multiple graphs can be established, and the model can be operated to merge and learn different types of relationship information, so that the model can make more reliable relationship reasoning.
After graph construction, single-layer relationship reasoning is implemented using GCN. For the relationship features, it is calculated as follows:
Figure BDA0003540201780000151
wherein g represents the number of the relational diagram, N g Representing the number of graphs, ReLU () representing the ReLU activation function, G g Shows a relationship diagram with the number g, W g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.
Difference characteristics are respectively obtained through the local residual error aggregation network module and the relation reasoning module based on the graph neural network
Figure BDA0003540201780000152
And relational featuresX', forming a two-branch inference network.
3. Weighted fusion module (WAS):
the appearance characteristic X and the difference characteristic are combined
Figure BDA0003540201780000153
And inputting the relation characteristic X' into the self-adaptive weighting fusion module, and performing weighting fusion on the three different semantic characteristics in the channel direction. The specific method comprises the following steps:
firstly, integrating information of all branch characteristics, and obtaining integrated characteristics through simple addition in element aspects, wherein the calculation mode is as follows:
Figure BDA0003540201780000161
wherein F represents the integrated characteristic, and F is equal to R T×N×D B is the branch number; n is a radical of b Representing the number of branches, N in the above formula b Is 3, X i Representing appearance features, difference features or relationship features,
Figure BDA0003540201780000162
respectively represent different semantic features X,
Figure BDA0003540201780000163
And X'.
Secondly, global information is simply embedded in the channel direction by using a global average pooling layer and a full connection layer to generate channel statistical information, and the calculation mode is as follows:
Figure BDA0003540201780000164
wherein S represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,
Figure BDA0003540201780000165
f (n, t:) representsAnd F, characteristics of the nth individual in the tth frame in the channel direction.
And thirdly, obtaining the weights of different branch characteristics in the channel direction through simple full-connection layer and softmax operation. The calculation method is as follows:
W b =softmax(w b S);
wherein, W b Weight vector of features branched into b, w b For the learnable linear transformation parameters of branch b that map the channel statistics S to weight vectors,
Figure BDA0003540201780000171
fourthly, finally, the calculation mode of each one-dimensional feature in the channel direction of the feature after weighted fusion is as follows:
Figure BDA0003540201780000172
wherein the content of the first and second substances,
Figure BDA0003540201780000173
representing weighted features
Figure BDA0003540201780000174
The value of the c-th channel of (c),
Figure BDA0003540201780000175
representing the weight of the feature in the c channel dimension of the b-th branch,
Figure BDA0003540201780000176
is W b The value of the c-th element;
Figure BDA0003540201780000177
a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,
Figure BDA0003540201780000178
is X b The value of the c-th element.
Through the WAS, the features under different semantics are subjected to self-adaptive weighted fusion, and information more important for group activity recognition is screened out.
4. By applying the above-obtained weighting characteristics
Figure BDA0003540201780000181
A pooling operation is performed resulting in a global representation representing the entire video segment. The global representation is processed through a simple full connection layer and then through the softmax operation to obtain the confidence coefficient of each group activity category of each frame. And predicting the group activity category by using the average value of the confidence degrees of the categories of each frame.
Secondly, a model training method for population detection based on a graph neural network and a residual error aggregation network is as follows:
1. and acquiring a video clip sample and a label corresponding to the sample, wherein the label represents the group activity of each key frame in the training sample.
2. Dividing the sample and the label thereof into two parts according to a proportion, wherein one part is a training set and is used for training the model; and a part is a verification set used for selecting the model.
3. And pre-training the backbone network by using the processed training set.
4. And (3) processing the samples in the training set, outputting a prediction result through a model, and calculating the loss of the prediction result and the real label by using cross loss entropy.
5. And training the model through back propagation and parameter updating, and carrying out reasoning test by using a verification set so as to select a better model result.
Example 2
As shown in fig. 1 to fig. 6, as a further optimization of the embodiment 1, this embodiment includes all the technical features of the embodiment 1, and in addition, this embodiment also includes the following technical features:
firstly, inputting a key frame of a given video clip and a corresponding bounding box of each individual into a backbone network by utilizing the backbone network and Roiliaign to obtain the appearance characteristics of the individual; secondly, inputting the appearance characteristics into a local residual error aggregation network module and a relation reasoning module based on a graph neural network to obtain corresponding difference characteristics and relation characteristics; then, performing self-adaptive weighted fusion on the appearance characteristic, the difference characteristic and the relation characteristic through a weighted fusion module to obtain a fused characteristic; and finally, performing pooling operation on the fused features to obtain video global representation, and inputting the video global representation into a classifier to obtain a final classification result.
In the local residual error aggregation network module, the appearance characteristics of each individual are input into the local residual error aggregation module, residual errors between every two individuals and residual error correlation coefficients corresponding to the residual errors are respectively calculated, and finally the difference characteristics of each individual are calculated under the constraint of spatial positions.
In the task of group action recognition, we put the proposed solution on two reference datasets: the Volleyball dataset and the Collective Activity dataset were compared to the most advanced methods. And we used two indices to evaluate model accuracy, MCA (multi-class classification accuracy) and MPCA (average per class accuracy), respectively.
For the Volleyball data set, the MCA is improved by 2.6% on the Volleyball data set through a proposed two-branch inference mode formed by a local residual error network module and a graph neural network-based relation inference module and a proposed weighting fusion module on the basis of ARG. Compared with the advanced method DIN based on the graph neural network, the method provided by the invention respectively improves MCA and MPCA by 0.9% and 1.2% under the condition of taking VCC16 as the backbone.
For the Collective Activity dataset, our proposed method improved MCA by 5.1% with VGG16 as backbone and ARG as baseline. Compared with the advanced method DIN based on the graph neural network, under the condition that RestNet18 and VGG16 are respectively used as backbones, the method provided by the invention respectively improves MPCA by 0.8 percent and 0.6 percent
In this method we extract 3 representative frames from each video segment as input and crop each frame of the Volleyball dataset to a size of 720 x 1280, per Collective Activity datasetOne frame is cropped to 480 × 720 size. The method adopts RestNet18 or VGG16 as a backbone network. An adaptive momentum estimation (Adam) optimizer is used to train the model, where β 1 =0.9,β 2 =0.999,∈=10 -8 . For the Volleyball dataset, the initial learning rate is set to 1e -4 The learning rate is updated every 10 epochs at a decay rate of 0.3, with a training period of 40. For the Collective Activity dataset, for RestNet18 or VGG16 learning rates are set to 4e respectively -5 And 1e -4 The training period is 30. The spatial limiting factor of the local residual aggregation network module is set to a graph width of 0.2 and 0.3 in the Volleyball data set and the Collective Activity data set, respectively. Canonical factor d of relational inference module based on graph neural network k Set to 256. The dimension shrinkage factor of the weighted fusion module is set to 16. The batch size of both datasets was set to 2 at training.
The accuracy on these two reference data sets reflects the advancement of our method. For analysis reasons, the method can have two main advantages: (1) the local residual error aggregation network module provided by the method can encode potential differences among all related actors in the crowd and provide additional clues for reasoning; (2) the weighting fusion strategy provided by the method can adaptively select more important information in different semantic characteristic values.
As described above, the present invention can be preferably realized.
All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims (10)

1. A group activity identification method based on a residual aggregation graph network is characterized by comprising the following steps:
s1, extracting appearance features: obtaining the appearance characteristics X of the individual level of the group to be identified by utilizing the key frames of the given video clip and the corresponding bounding boxes of each individual,
Figure FDA0003540201770000011
x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i represents the number of an individual in the video clip key frame, i is 1, 2, …, T × N; x is the number of i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
s2, double branch reasoning: performing difference relation reasoning on the appearance characteristics based on residual error aggregation to obtain difference characteristics
Figure FDA0003540201770000012
And performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
s3, weighted fusion: weighting and fusing the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics
Figure FDA0003540201770000013
S4, group activity prediction: and performing pooling operation on the weighted features to obtain a global representation representing the whole video segment, further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.
2. The method for group activity recognition based on residual aggregation graph network according to claim 1, wherein in step S2, the formula for performing the difference relationship inference based on residual aggregation on the appearance features to obtain the difference features is as follows:
Figure FDA0003540201770000021
where j denotes the number of an individual in a video clip, j ═ 1, 2, …, txn;
Figure FDA0003540201770000022
representing the difference characteristic of the jth individual,
Figure FDA0003540201770000023
is a difference characteristic
Figure FDA0003540201770000024
Is a function of one of the elements of (1),
Figure FDA0003540201770000025
Figure FDA0003540201770000026
r i (x j ) Representing the residual relationship between individual j and individual i,
Figure FDA0003540201770000027
representing the spatial position correlation, x, between an individual j and an individual i j -x i Representing the difference in appearance characteristics between individual j and individual i in different time slots.
3. The method for group activity recognition based on residual aggregation graph network according to claim 2, wherein r is i (x j ) The calculation formula of (2) is as follows:
Figure FDA0003540201770000028
wherein, w j Representing a weight, b, mapping the apparent difference between two volumes centered on an individual j to a scalar j Offset value, w, representing a scalar quantity that maps appearance differences between two volumes centered on an individual j j ∈R D ,b j ∈R 1
4. The method according to claim 3, wherein the group activity recognition method based on the residual aggregation graph network,
Figure FDA0003540201770000031
the calculation formula of (2) is as follows:
Figure FDA0003540201770000032
wherein Π (·) represents an indicator function,
Figure FDA0003540201770000033
denotes the euclidean distance between individual j and individual i and μ denotes the spatial limiting factor.
5. The method for group activity recognition based on residual aggregation graph network according to claim 1, wherein in step S2, the similarity relationship inference based on graph neural network is performed on the appearance features respectively to obtain a formula of relationship features:
Figure FDA0003540201770000034
wherein g represents the number of the relational diagram, N g Representing the number of graphs, ReLU () representing the ReLU activation function, G g Shows a relationship diagram with the number g, W g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.
6. The method of claim 5A group activity identification method based on a residual aggregation graph network is characterized in that G g The calculation formula of (2) is as follows:
G g ={G i,j ∈R 1 |i,j=1,...,T×N},
Figure FDA0003540201770000041
wherein G is i,j Representing the magnitude of the similarity relationship between the individual i and the individual j, f a (x i ,x j ) Representing the apparent correlation between individual i and individual j.
7. The method according to claim 6, wherein f is a group activity recognition method based on a residual aggregation graph network a (x i ,x j ) The calculation formula of (2) is as follows:
Figure FDA0003540201770000042
wherein, theta (x) i ) Appearance feature x representing the appearance of an individual i i By D-dimension embedding
Figure FDA0003540201770000043
A linear transformation of the dimensional space is performed,
Figure FDA0003540201770000044
appearance feature x representing the individual j j By D-dimension embedding
Figure FDA0003540201770000045
Linear transformation of dimensional space, Tanspose () representing a transpose operation, d k Indicating a normalization factor.
8. The method for group activity recognition based on residual aggregation graph network as claimed in claim 7, wherein the step S3 comprises the following steps,
s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:
Figure FDA0003540201770000051
wherein F represents the integrated characteristic, and F is equal to R T×N×D B is the branch number; n is a radical of b Representing the number of branches, N in the above formula b Is 3, X i Representing appearance features, difference features or relationship features,
Figure FDA0003540201770000052
respectively represent different semantic features X,
Figure FDA0003540201770000053
And X';
s32, embedding global information in the channel direction by using the global average pooling and the full connection layer to generate channel statistical information, wherein the calculation method is as follows:
Figure FDA0003540201770000054
wherein S represents channel statistics, W S Representing learnable parameters for linear transformation of pooled features,
Figure FDA0003540201770000055
f (n, t:) represents the characteristics of the nth individual of the t-th frame in F in the channel direction.
S33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and softmax, wherein the calculation mode is as follows:
W b =softmax(w b S);
wherein, W b Weight vector of features branched into b, w b For branch bThe channel statistics S are mapped to the learnable linear transformation parameters of the weight vector,
Figure FDA0003540201770000061
s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:
Figure FDA0003540201770000062
wherein the content of the first and second substances,
Figure FDA0003540201770000063
representing weighted features
Figure FDA0003540201770000064
The value of the c-th channel of (c),
Figure FDA0003540201770000065
representing the weight of the feature in the c channel dimension of the b-th branch,
Figure FDA0003540201770000066
is W b The value of the c-th element;
Figure FDA0003540201770000067
a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,
Figure FDA0003540201770000068
is X b The value of the c-th element.
9. The method of claim 8, wherein in step S4, the global representation is passed through a full link layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.
10. The system applied to the group activity recognition method based on the residual aggregation map network is characterized by comprising an appearance feature extraction module, a double-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance feature extraction module is also electrically connected with the weighted fusion module;
wherein;
appearance characteristic extraction module: the method is used for obtaining appearance characteristics X of the individual level of the group to be identified by utilizing key frames of the given video clip and the bounding box of each individual,
Figure FDA0003540201770000071
x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i represents the number of an individual in the video clip key frame, i is 1, 2, …, T × N; x is the number of i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
a two-branch reasoning module: difference relation reasoning for performing residual error aggregation based on appearance characteristics to obtain difference characteristics
Figure FDA0003540201770000072
And performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
a weighted fusion module: the method is used for performing weighted fusion on the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics
Figure FDA0003540201770000073
A group activity prediction module: the method comprises the steps of performing a pooling operation on weighted features to obtain a global representation representing the whole video segment, then further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.
CN202210236706.2A 2022-03-10 2022-03-10 Group activity identification method and system based on residual aggregation graph network Active CN114863356B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210236706.2A CN114863356B (en) 2022-03-10 2022-03-10 Group activity identification method and system based on residual aggregation graph network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210236706.2A CN114863356B (en) 2022-03-10 2022-03-10 Group activity identification method and system based on residual aggregation graph network

Publications (2)

Publication Number Publication Date
CN114863356A true CN114863356A (en) 2022-08-05
CN114863356B CN114863356B (en) 2023-02-03

Family

ID=82627853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210236706.2A Active CN114863356B (en) 2022-03-10 2022-03-10 Group activity identification method and system based on residual aggregation graph network

Country Status (1)

Country Link
CN (1) CN114863356B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401174A (en) * 2020-03-07 2020-07-10 北京工业大学 Volleyball group behavior identification method based on multi-mode information fusion
WO2021032295A1 (en) * 2019-08-21 2021-02-25 Toyota Motor Europe System and method for detecting person activity in video
CN112434608A (en) * 2020-11-24 2021-03-02 山东大学 Human behavior identification method and system based on double-current combined network
CN112613349A (en) * 2020-12-04 2021-04-06 北京理工大学 Time sequence action detection method and device based on deep hybrid convolutional neural network
CN112699786A (en) * 2020-12-29 2021-04-23 华南理工大学 Video behavior identification method and system based on space enhancement module
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
US20210248375A1 (en) * 2020-02-06 2021-08-12 Mitsubishi Electric Research Laboratories, Inc. Scene-Aware Video Dialog
CN113435430A (en) * 2021-08-27 2021-09-24 中国科学院自动化研究所 Video behavior identification method, system and equipment based on self-adaptive space-time entanglement
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021032295A1 (en) * 2019-08-21 2021-02-25 Toyota Motor Europe System and method for detecting person activity in video
US20210248375A1 (en) * 2020-02-06 2021-08-12 Mitsubishi Electric Research Laboratories, Inc. Scene-Aware Video Dialog
CN111401174A (en) * 2020-03-07 2020-07-10 北京工业大学 Volleyball group behavior identification method based on multi-mode information fusion
CN112434608A (en) * 2020-11-24 2021-03-02 山东大学 Human behavior identification method and system based on double-current combined network
CN112613349A (en) * 2020-12-04 2021-04-06 北京理工大学 Time sequence action detection method and device based on deep hybrid convolutional neural network
CN112699786A (en) * 2020-12-29 2021-04-23 华南理工大学 Video behavior identification method and system based on space enhancement module
CN112926396A (en) * 2021-01-28 2021-06-08 杭州电子科技大学 Action identification method based on double-current convolution attention
CN112818843A (en) * 2021-01-29 2021-05-18 山东大学 Video behavior identification method and system based on channel attention guide time modeling
CN113435430A (en) * 2021-08-27 2021-09-24 中国科学院自动化研究所 Video behavior identification method, system and equipment based on self-adaptive space-time entanglement
CN113673489A (en) * 2021-10-21 2021-11-19 之江实验室 Video group behavior identification method based on cascade Transformer

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JINHUI TANG ET.AL: "Coherence Constrained Graph LSTM", 《IEEE》 *
张乐军: "基于监控视频的群体行为识别算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
熊辛: "基于行为特征优化和深度学习的行为识别研究", 《中国博士学位论文全文数据库信息科技辑》 *

Also Published As

Publication number Publication date
CN114863356B (en) 2023-02-03

Similar Documents

Publication Publication Date Title
CN110084151B (en) Video abnormal behavior discrimination method based on non-local network deep learning
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN112507901B (en) Unsupervised pedestrian re-identification method based on pseudo tag self-correction
CN109299657B (en) Group behavior identification method and device based on semantic attention retention mechanism
CN110827265B (en) Image anomaly detection method based on deep learning
CN113157957A (en) Attribute graph document clustering method based on graph convolution neural network
Chi et al. A decision support system for detecting serial crimes
CN112861970B (en) Fine-grained image classification method based on feature fusion
CN115761900B (en) Internet of things cloud platform for practical training base management
CN112507778B (en) Loop detection method of improved bag-of-words model based on line characteristics
CN113297936A (en) Volleyball group behavior identification method based on local graph convolution network
CN111598032B (en) Group behavior recognition method based on graph neural network
CN114005085A (en) Dense crowd distribution detection and counting method in video
CN115273244A (en) Human body action recognition method and system based on graph neural network
Shi et al. Dynamic barycenter averaging kernel in RBF networks for time series classification
CN112163020A (en) Multi-dimensional time series anomaly detection method and system
CN110111365B (en) Training method and device based on deep learning and target tracking method and device
CN114863356B (en) Group activity identification method and system based on residual aggregation graph network
CN116704609A (en) Online hand hygiene assessment method and system based on time sequence attention
CN116257786A (en) Asynchronous time sequence classification method based on multi-element time sequence diagram structure
CN107665325A (en) Video accident detection method and system based on atomic features bag model
CN117011219A (en) Method, apparatus, device, storage medium and program product for detecting quality of article
CN116012903A (en) Automatic labeling method and system for facial expressions
CN115293249A (en) Power system typical scene probability prediction method based on dynamic time sequence prediction
CN114386494A (en) Product full life cycle quality tracing method and device based on extensible ontology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant