CN114863356B - Group activity identification method and system based on residual aggregation graph network - Google Patents
Group activity identification method and system based on residual aggregation graph network Download PDFInfo
- Publication number
- CN114863356B CN114863356B CN202210236706.2A CN202210236706A CN114863356B CN 114863356 B CN114863356 B CN 114863356B CN 202210236706 A CN202210236706 A CN 202210236706A CN 114863356 B CN114863356 B CN 114863356B
- Authority
- CN
- China
- Prior art keywords
- individual
- representing
- appearance
- difference
- branch
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000000694 effects Effects 0.000 title claims abstract description 63
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000002776 aggregation Effects 0.000 title claims abstract description 39
- 238000004220 aggregation Methods 0.000 title claims abstract description 39
- 230000004927 fusion Effects 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 24
- 238000013528 artificial neural network Methods 0.000 claims description 21
- 230000009466 transformation Effects 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 8
- 239000013598 vector Substances 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 229910052731 fluorine Inorganic materials 0.000 claims description 2
- 125000001153 fluoro group Chemical group F* 0.000 claims description 2
- 229910052739 hydrogen Inorganic materials 0.000 claims description 2
- 239000001257 hydrogen Substances 0.000 claims description 2
- 125000004435 hydrogen atom Chemical class [H]* 0.000 claims description 2
- 239000004973 liquid crystal related substance Substances 0.000 claims 1
- 230000009471 action Effects 0.000 abstract description 12
- 238000012216 screening Methods 0.000 abstract description 5
- 238000010191 image analysis Methods 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 7
- 238000001514 detection method Methods 0.000 description 6
- 239000010410 layer Substances 0.000 description 6
- 238000012544 monitoring process Methods 0.000 description 4
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and discloses a group activity identification method and a group activity identification system based on a residual aggregation graph network, wherein the method comprises the following steps: s1, extracting appearance characteristics; s2, double-branch reasoning; s3, weighting and fusing; and S4, predicting group activities. The invention solves the problems that video clips with similar individual actions but different group activity categories are difficult to effectively distinguish, importance screening on different semantic features is lacked and the like in the prior art.
Description
Technical Field
The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and specifically relates to a group activity identification method and system based on a residual aggregation graph network.
Background
With the growing urban population and the dramatic increase in the flow of people in public areas, people monitoring and management face great challenges and pressures. If the video monitoring technology can be used for monitoring and alarming the abnormal group behavior phenomenon in the important area in time, relevant departments can take corresponding measures aiming at the early warning or alarming phenomenon in the shortest time to minimize the occurrence possibility of safety accidents and minimize the loss caused by the accidents. Therefore, more and more video monitoring systems are applied to public places to maintain public order and improve public area security, and group activity analysis is also receiving more and more attention.
The difficulty in group activity detection is that in addition to the need to master the actions of individuals, the potential connections between individuals in a group need to be mastered. Therefore, for better recognition of group activities, it is critical to utilize various information, such as appearance information, spatial location information, similarity relationship information, and difference information, among others.
For group activity recognition at present, the problem is mostly solved by a method based on a graph neural network. The method mainly comprises the following steps:
(1) and extracting appearance characteristics of each individual in a plurality of representative frames of the corresponding video clip through the basic network.
(2) And capturing the correlation among individuals in the group in a graph mode, and extracting the relation characteristics by graph convolution.
(3) And performing simple addition fusion and pooling operation on the individual appearance signs and the relationship characteristics to obtain video characteristics representing the whole video clip.
(4) And sending the video characteristics to a classifier to obtain a corresponding group activity classification result.
Such a method based on a graph neural network firstly ignores difference information (such as a small difference between close actions) existing between active persons in a video, which is very important for effectively distinguishing video segments with similar individual actions but different group activity categories; secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order reasoning characteristic, and the fusion mode lacks of importance screening of different semantic characteristics.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a group activity identification method and system based on a residual aggregation graph network, and solves the problems that video segments with similar individual actions but different group activity types are difficult to effectively distinguish, importance screening on different semantic features is lacked, and the like in the prior art.
The technical scheme adopted by the invention for solving the problems is as follows:
a group activity identification method based on a residual aggregation graph network comprises the following steps:
s1, appearance feature extraction: obtaining the appearance characteristics X of the individual level of the group to be identified by utilizing the key frames of the given video clip and the corresponding bounding boxes of each individual,x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i denotes the number of individuals in the video clip key-frame, i =1,2, \8230;, T × N; x is a radical of a fluorine atom i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
s2, double-branch reasoning: performing difference relation reasoning on the appearance characteristics based on residual error aggregation to obtain difference characteristicsAnd respectively carrying out similarity relation reasoning on the appearance characteristics based on a graph neural network to obtain relation characteristics X';
s3, weighted fusion: weighting and fusing the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtainWeighted features
S4, predicting group activities: and performing pooling operation on the weighted features to obtain a global representation representing the whole video segment, further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.
As a preferred technical solution, in step S2, performing difference relation inference based on residual aggregation on the appearance features to obtain a formula of difference features, where the formula is:
wherein j represents the number of an individual in a video clip, j =1,2, \8230;, T × N;representing the difference characteristic of the jth individual,is a difference featureIs a function of one of the elements of (a), r i (x j ) Representing the residual relationship between individual j and individual i,representing the spatial position correlation, x, between an individual j and an individual i j -x i Representing the difference in appearance characteristics between individual j and individual i in different time slots.
As a preferred embodiment, r i (x j ) The calculation formula of (2) is as follows: :
wherein, w j Representing a weight, b, mapping the apparent difference between two volumes centered on an individual j to a scalar j Offset value w representing a scalar quantity that maps appearance differences between two volumes centered on an individual j j ∈R D ,b j ∈R 1 。
wherein Π (·) represents an indicator function,denotes the euclidean distance between the individual j and the individual i, and μ denotes a spatial restriction factor.
As a preferred technical solution, in step S2, the appearance features are respectively subjected to similarity relationship inference based on a graph neural network to obtain a formula of relationship features, where the formula is as follows:
wherein g represents the number of the relational diagram, N g Representing the number of graphs, reLU () representing the ReLU activation function, G g Shows a relational graph with the number g, W g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.
As a preferred embodiment, G g The calculation formula of (2) is as follows:
G g ={G i,j ∈R 1 |i,j=1,...,T×N},
wherein G is i,j Representing the magnitude of the similarity relationship between the individual i and the individual j, f a (x i ,x j ) Representing the apparent correlation between individual i and individual j.
As a preferred solution, f a (x i ,x j ) The calculation formula of (2) is as follows:
wherein, theta (x) i ) Appearance feature x representing the appearance of an individual i i By D-dimension embeddingA linear transformation of the dimensional space is performed,appearance feature x representing the individual j j By D-dimension embeddingLinear transformation of dimensional space, tanspose () representing a transpose operation, d k Indicating a normalization factor.
As a preferred technical solution, the step S3 includes the steps of,
s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:
wherein F represents an integerCombined characteristics, F ∈ R T×N×D B is the branch number; n is a radical of b Representing the number of branches, N in the above formula b Is 3,X i Representing appearance features, difference features or relationship features,respectively represent different semantic features X,And X';
s32, embedding global information in the channel direction by using a global average pooling layer and a full connection layer to generate channel statistical information, wherein the calculation mode is as follows:
wherein S represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,f (n, t:) represents the characteristics of the nth individual of the t frame in F in the channel direction;
s33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and the softmax, wherein the calculation mode is as follows:
W b =softmax(w b S);
wherein, W b Weight vector of features branched into b, w b For the learnable linear transformation parameters of branch b that map the channel statistics S to weight vectors,
s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:
wherein,representing weighted featuresThe value of the c-th channel of (c),representing the weight of the feature in the c-th channel dimension of the b-th branch,is W b The value of the c-th element;a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,is X b The value of the c-th element.
As a preferred technical solution, in step S4, the global representation is passed through a full connection layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.
The system applied to the group activity recognition method based on the residual aggregation map network comprises an appearance characteristic extraction module, a two-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance characteristic extraction module is also electrically connected with the weighted fusion module;
wherein;
appearance characteristic extraction module: the method is used for obtaining appearance characteristics X of the individual level of the group to be identified by utilizing key frames of the given video clip and the bounding box of each individual,x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i denotes the number of individuals in the video clip key-frame, i =1,2, \8230;, T × N; x is the number of i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
a two-branch reasoning module: difference relation reasoning for performing residual error aggregation based on appearance characteristics to obtain difference characteristicsAnd performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
a weighted fusion module: the method is used for performing weighted fusion on the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics
A group activity prediction module: the method comprises the steps of performing a pooling operation on weighted features to obtain a global representation representing the whole video segment, then further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.
Compared with the prior art, the invention has the following beneficial effects:
(1) The method utilizes the difference information among the actors in the video, which is very important for effectively distinguishing the video segments which have similar individual actions but different group activity categories; the method for capturing the potential useful difference information in the group and the method for fusing different semantic features by self-adaptive weighting greatly improve the accuracy of group activity detection;
(2) The local residual error aggregation network module provided by the invention can encode potential differences among all related actors in a crowd and provide additional clues for reasoning;
(3) The weighted fusion strategy provided by the invention can adaptively select more important information in different semantic features.
Drawings
FIG. 1 is a schematic diagram illustrating steps of a group activity recognition method based on a residual aggregation graph network according to the present invention;
FIG. 2 is a block diagram of a group activity recognition system based on a residual aggregation graph network according to the present invention;
FIG. 3 is a schematic view of a workflow of a group activity recognition system based on a residual aggregation graph network according to the present invention;
FIG. 4 is a schematic diagram of a local residual aggregation network module according to the present invention;
FIG. 5 is a schematic structural diagram of a weighted fusion module according to the present invention;
FIG. 6 is a flow chart of model training according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.
Example 1
As shown in fig. 1 to 6, the present invention provides a group activity recognition method and a model training method, which can detect an action category of a group from a video.
The appearance information and the space-time information of the individuals are required to be extracted for detecting the actions of the individuals in the video, and each individual has own action behavior in a real scene, so that the identification of group activities is required to extract the appearance information and the space-time information of each individual and model the potential information among the individuals so as to deduce the potential relationship among the groups relative to the detection of the actions of the individuals.
Most population-based methods based on graph neural networks first ignore the difference information that exists between the active people in the video (such as the slight difference between close actions); secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order relation characteristic, and the fusion mode lacks the importance screening of different semantic characteristics. The method for searching the mode capable of capturing the potential useful difference information in the group and the method for adaptively weighting and fusing different semantic features are very important for improving the accuracy of group activity detection.
In order to solve the above problems, the present invention provides a group activity identification method and system based on a residual aggregation graph network. The local residual error aggregation network module is used for capturing potential useful difference information in a population, and can be combined with the relation inference network module based on the graph neural network to form double-branch inference. The weighting fusion module is used for adaptively weighting and fusing different semantic features to achieve the effect of screening important information.
The main technical solution of the present invention will be described in detail with reference to specific examples.
1. The group activity detection method based on the graph neural network and the residual error aggregation network comprises the following steps:
1. extracting basic appearance characteristics:
the key frames of the given video clip and the corresponding bounding boxes of each individual are subjected to individual-level appearance characteristics by using a backbone network and Roialign, and the appearance characteristics are expressed asWherein x is i Representing the appearance characteristics, x, of each individual i ∈R D 。
2. The two-branch reasoning network module:
and inputting the obtained appearance characteristic X into a local residual error aggregation network module and a relation inference network module based on a graph neural network respectively to carry out double-branch high-order inference. These two networks will be described separately below.
(1) Local residual aggregation network module (LR) 2 M):
Inputting the obtained appearance characteristic into LR 2 M, modeling the difference information between individuals in the group to obtain the difference characteristics(whereinIs a difference characteristicIs a function of one of the elements of (1),representing the difference signature of the jth individual).The calculation method of (c) is as follows:
wherein x is j -x i Representing the difference in appearance between the jth individual and the ith individual; r is a radical of hydrogen i (x j ) Representing the residual relationship between an individual j and an individual i, when there is a useful difference relationship between (j, i), r i (x j ) =1, when there is no useful difference relation r between (j, i) i (x j )=0;Is a space limitation.
To make the above equation differentiable so as to be able to propagate backwards during network training, we will let r i (x j ) Performing smoothing on r i (x j ) The calculation is shown below:
wherein, w j Representing a weight, b, that maps the apparent difference between two volumes centered on an individual j as a scalar j Offset value, w, representing a scalar quantity that maps appearance differences between two volumes centered on an individual j j ∈R D ,b j ∈R 1 。
Previous research experiments have demonstrated that local information is more conducive to group activity category reasoning. Therefore, the local residual relation is aggregated by using a distance mask, and the calculation method is as follows:
wherein pi (·) is an indication function, μ is a space limiting factor, and is a hyper-parameter, and the specific value is determined according to the situation;is the euclidean distance between the individual j and the individual i.
group activities are inferred by modeling local differences between the active people in the group to obtain useful difference information in the video clip.
(2) The relation reasoning module based on the neural network of the graph:
for relational modeling, a graph neural network is adopted to establish an actor-relational graph, and relational information between actors can be provided for group activity recognition by utilizing a graph structure.
Each node in the actor-relationship graph represents each individual, while the importance of the relationship between two individuals is represented by the edge weights between the two actors. The weight of each edge in the graph is determined by the appearance characteristics and the spatial position of the individuals at the two ends. The method for calculating the weight of the edge between two body nodes in the method comprises the following steps:
wherein, G i,j Representing the magnitude of the similarity relationship between the individual i and the individual j, f a (x i ,x j ) Representing the apparent correlation between individual i and individual j.
For the appearance characteristic correlation, an embedded dot product method is adopted, and the corresponding formula is as follows:
wherein, theta (x) i ) Appearance feature x representing the appearance of an individual i i By D-dimension embeddingA linear transformation of the dimensional space is performed,appearance feature x representing the individual j j By D-dimension embeddingLinear transformation of the dimensional space, tanspose () representing a transpose operation, d k Represents a normalization factor, d k Is a constant.
For spatial position correlation, the same distance masking approach as the above local residual aggregation network is adopted:
wherein Π (·) represents an indicator function,denotes the euclidean distance between individual j and individual i and μ denotes the spatial limiting factor.
Thus, for a graph of relationships between two actors in a population, it can be expressed as:
G g ={G i,j ∈R 1 |i,j=1,...,T×W};
and meanwhile, a plurality of relation graphs are constructed to capture different related information. In the present invention, a series of graphs are builtEach graph is computed separately and does not share weights. The multiple graphs can be established, and the model can be operated to merge and learn different types of relationship information, so that the model can make more reliable relationship reasoning.
After graph construction, single-layer relationship reasoning is implemented using GCN. For the relationship features, they are calculated as follows:
wherein g represents the number of the relational diagram, N g Representing the number of graphs, reLU () representing the ReLU activation function, G g Shows a relational graph with the number g, W g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.
Difference characteristics are respectively obtained through the local residual error aggregation network module and the relation reasoning module based on the graph neural networkAnd a relation feature X' to form a two-branch reasoning network.
3. Weighted fusion module (WAS):
the appearance characteristics X and the differenceFeature(s)And inputting the relation characteristic X' into the self-adaptive weighting fusion module, and performing weighting fusion on the three different semantic characteristics in the channel direction. The specific method comprises the following steps:
(1) firstly, integrating information of all branch characteristics, and obtaining the integrated characteristics through simple addition in element aspects, wherein the calculation mode is as follows:
wherein F represents the integrated characteristic, and F is epsilon of R T×N×D B is the branch number; n is a radical of b Representing the number of branches, N in the above formula b Is 3,X i Representing appearance features, difference features or relationship features,respectively represent different semantic features X,And X'.
(2) The global information is simply embedded in the channel direction using the global average pooling and full-link layers to generate channel statistics, calculated as follows:
wherein S represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,f (n, t:) represents the characteristics of the nth individual of the t-th frame in F in the channel direction.
(3) The weights of the different branch features in the channel direction are obtained by simple full connectivity and softmax operations. The calculation method is as follows:
W b =softmax(w b S);
wherein, W b Weight vector of features branched into b, w b For the learnable linear transformation parameters of branch b that map the channel statistics S to weight vectors,
(4) finally, the calculation mode of each one-dimensional feature in the channel direction of the weighted and fused feature is as follows:
wherein,representing weighted featuresThe value of the c-th channel of (c),representing the weight of the feature in the c-th channel dimension of the b-th branch,is W b The value of the c-th element;a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,is X b The value of the c-th element.
Through the WAS, the features under different semantics are subjected to adaptive weighted fusion, and information more important for group activity identification is screened out.
4. By applying the above-obtained weighting characteristicsA pooling operation is performed resulting in a global representation representing the entire video segment. And (4) obtaining the confidence of each group activity category of each frame by a simple full connection layer and then by softmax operation through the global representation. And predicting the group activity category by using the average value of the confidence degrees of the categories of each frame.
2. The model training method for group detection based on the graph neural network and the residual error aggregation network comprises the following steps:
1. and acquiring a video clip sample and a label corresponding to the sample, wherein the label represents the group activity of each key frame in the training sample.
2. Dividing the sample and the label thereof into two parts according to a proportion, wherein one part is a training set and is used for training the model; and a part is a verification set used for selecting the model.
3. And pre-training the backbone network by using the processed training set.
4. And (3) processing the samples in the training set, outputting a prediction result through a model, and calculating the loss of the prediction result and the real label by using cross loss entropy.
5. And training the model through back propagation and parameter updating, and performing reasoning test by using a verification set so as to select a better model result.
Example 2
As shown in fig. 1 to fig. 6, as a further optimization of the embodiment 1, this embodiment includes all the technical features of the embodiment 1, and in addition, this embodiment also includes the following technical features:
firstly, inputting a key frame of a given video clip and a corresponding bounding box of each individual into a backbone network by utilizing the backbone network and Roialign to obtain the appearance characteristics of the individual; secondly, inputting the appearance characteristics into a local residual error aggregation network module and a relation reasoning module based on a graph neural network to obtain corresponding difference characteristics and relation characteristics; then, performing self-adaptive weighted fusion on the appearance characteristic, the difference characteristic and the relation characteristic through a weighted fusion module to obtain a fused characteristic; and finally, performing pooling operation on the fused features to obtain video global representation, and inputting the video global representation into a classifier to obtain a final classification result.
In the local residual error aggregation network module, the appearance characteristics of each individual are input into the local residual error aggregation module, residual errors between every two individuals and residual error correlation coefficients corresponding to the residual errors are respectively calculated, and finally the difference characteristics of each individual are calculated under the constraint of spatial positions.
In the task of group action recognition, we put the proposed solution on two reference datasets: the Volleyball dataset and the Collective Activity dataset were compared to the most advanced methods. And we used two indices to evaluate model accuracy, MCA (multi-class classification accuracy) and MPCA (average per class accuracy), respectively.
For the Volleyball data set, the MCA is improved by 2.6% on the Volleyball data set through a proposed two-branch inference mode formed by a local residual error network module and a graph neural network-based relation inference module and a proposed weighting fusion module on the basis of ARG. Compared with the advanced method DIN based on the neural network of the graph, the method provided by the invention respectively improves MCA and MPCA by 0.9% and 1.2% under the condition of taking VGG16 as the backbone.
For the Collective Activity dataset, our proposed method improved MCA by 5.1% with VGG16 as backbone and ARG as baseline. Compared with advanced methods DIN based on a graph neural network, the method respectively takes RestNet18 and VGG16 as backbones, and the method is respectively improved by 0.8 percent and 0.6 percent in MPCA
In this method we take 3 representative frames from each video segment as input and clip each frame of the Volleyball dataset to a size of 720 × 1280 and each frame of the Collective Activity dataset to a size of 480 × 720. The method adopts RestNet18 or VGG16 as a backbone network. Adaptive momentum estimation (Adam) optimizerIn a training model, wherein 1 =0.9,β 2 =0.999,∈=10 -8 . For the Volleyball dataset, the initial learning rate is set to 1e -4 The learning rate is updated every 10 epochs at a decay rate of 0.3, with a training period of 40. For the Collective Activity dataset, the RestNet18 or VGG16 learning rates are set to 4e for -5 And 1e -4 The training period is 30. The spatial limiting factor of the local residual aggregation network module is set to a graph width of 0.2 and 0.3 in the Volleyball data set and the Collective Activity data set, respectively. Canonical factor d of relational inference module based on graph neural network k Set to 256. The dimension shrinkage factor of the weighted fusion module is set to 16. The batch size of both datasets was set to 2 at training.
The accuracy on these two reference data sets reflects the advancement of our method. For analysis reasons, the method can have two main advantages: (1) The local residual error aggregation network module provided by the method can encode potential differences among all related actors in the crowd and provide additional clues for reasoning; (2) The weighting fusion strategy provided by the method can adaptively select more important information in different semantic characteristic values.
As described above, the present invention can be preferably realized.
All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.
The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.
Claims (8)
1. A group activity identification method based on a residual aggregation graph network is characterized by comprising the following steps:
s1, appearance feature extraction: obtaining the appearance characteristics X of the individual level of the group to be identified by utilizing the key frames of the given video clip and the bounding boxes of each individual,x i ∈R D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i represents the number of individuals in the key frame of the video segment, i =1,2, \8230;, T × N; x is a radical of a fluorine atom i Representing the appearance characteristics of individuals numbered i in the video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
s2, double-branch reasoning: performing difference relation reasoning on the appearance characteristics based on residual error aggregation to obtain difference characteristicsAnd performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
s3, weighted fusion: weighting and fusing the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics
S4, predicting group activities: pooling the weighted features to obtain a global representation representing the whole video segment, further processing to obtain a confidence coefficient of each frame for the group activity category, and predicting the group activity category by using an average value of the confidence coefficients of each category of each frame;
in step S2, the appearance features are subjected to difference relationship inference based on residual aggregation to obtain a difference feature formula as follows:
wherein j represents the number of an individual in a video clip, j =1,2, \8230;, T × N;representing the difference characteristic of the jth individual,is a difference characteristicIs a function of one of the elements of (1), r i (x j ) Representing the residual relationship between individual j and individual i,representing the spatial position correlation, x, between an individual j and an individual i j -x i Representing the difference in appearance characteristics between an individual j and an individual i in different sky;
the step S3 comprises the following steps,
s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:
wherein F represents the integrated characteristic, and F is epsilon of R T×N×D B is the branch number; n is a radical of b Representing the number of branches, N in the above formula b Is 3,X l Representing appearance features, difference features or relationship features,respectively represent different semantic features X,And X';
s32, embedding global information in the channel direction by using a global average pooling layer and a full connection layer to generate channel statistical information, wherein the calculation mode is as follows:
where s represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,f (n, t:) represents the characteristics of the nth individual in the t frame in the F in the channel direction;
s33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and the softmax, wherein the calculation mode is as follows:
W b =softmax(w b S);
wherein, W b Weight vector of features branched into b, w b For branch b the mathematically linear transformation parameters that map the channel statistics S to weight vectors,
s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:
wherein,representing weighted featuresThe value of the c-th channel of (c),representing the weight of the feature in the c channel dimension of the b-th branch,is W b The value of the c-th element;a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,is X b The value of the c-th element.
2. The method of claim 1, wherein r is the group activity recognition method based on the residual aggregation graph network i (x j ) The calculation formula of (2) is as follows:
wherein, w j Representing a weight, b, that maps the apparent difference between two volumes centered on an individual j as a scalar j Offset value, w, representing a scalar quantity that maps the apparent difference between two volumes centered on an individual j j ∈R D ,b j ∈R 1 。
3. The method according to claim 2, wherein the group activity recognition method based on the residual aggregation graph network,the calculation formula of (c) is:
4. The group activity recognition method based on the residual aggregation graph network as claimed in claim 1, wherein in step S2, the appearance features are respectively subjected to graph neural network-based similarity relationship inference to obtain a relationship feature formula as follows:
wherein g represents the number of the relational diagram, N g Representing the number of graphs, reLU () representing the ReLU activation function, G g Shows a relational graph with the number g, W g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.
5. The method of claim 4, wherein G is a group activity recognition method based on the residual aggregation graph network g The calculation formula of (2) is as follows:
G g ={G i,j ∈R 1 |i,j=1,...,T×N},
wherein G is i,j Representing the magnitude of the similarity relationship between the individual i and the individual j, f a (x i ,x j ) Representing the apparent correlation between individual i and individual j.
6. The method according to claim 5, wherein f is a group activity recognition method based on a residual aggregation graph network a (x i ,x j ) The calculation formula of (2) is as follows:
wherein, theta (x) i ) Appearance characteristic x representing the individual i i By D-dimension embeddingA linear transformation of the dimensional space is performed,appearance feature x representing the individual j j By D-dimension embeddingLinear transformation of dimensional space, tansp. se () represents a transpose operation, d k Indicating a normalization factor.
7. The method of claim 6, wherein in step S4, the global representation is passed through a full link layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.
8. The system applied to the group activity identification method based on the residual aggregation map network is characterized by comprising an appearance feature extraction module, a double-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance feature extraction module is also electrically connected with the weighted fusion module;
wherein;
appearance feature extraction module: the method is used for obtaining appearance characteristics X of the individual level of the group to be identified by utilizing key frames of the given video clip and the bounding box of each individual,x i ∈R D wherein, T represents the frame number of the key frame of the video clip, and N represents the number of each key frame in the video clip; i represents the number of individuals in the key frame of the video segment, i =1,2, \8230;, T × N; x is the number of i Representing the appearance characteristics of individuals numbered i in the video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;
a two-branch reasoning module: difference relation reasoning for performing residual error aggregation based on appearance characteristics to obtain difference characteristicsAnd performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';
a weighted fusion module: the method is used for performing weighted fusion on the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics
A group activity prediction module: performing pooling operation on the weighted features to obtain a global representation representing the whole video segment, further processing to obtain confidence of each frame to the group activity category, and predicting the group activity category by using an average value of the confidence of each category of each frame;
when the dual-branch reasoning module works, the appearance characteristics are subjected to difference relation reasoning based on residual error aggregation to obtain a formula of difference characteristics, wherein the formula is as follows:
wherein j represents the number of an individual in a video clip, j =1,2, \8230;, T × N;representing the difference characteristic of the jth individual,is a difference featureIs a function of one of the elements of (a), r i (x j ) Representing the residual relationship between individual j and individual i,representing the spatial position correlation, x, between an individual j and an individual i j -x i Representing the difference of appearance characteristics between an individual j and an individual i in different sky;
the weighted fusion module performs the following steps,
s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:
wherein F represents the integrated characteristic, and F is equal to R T×N×D B is the branch number; n is a radical of hydrogen b Representing the number of branches, N in the above formula b Is 3,X l Indicating appearance, difference, or offIs characterized in that the first and second liquid crystal panels,respectively represent different semantic features X,And X';
s32, embedding global information in the channel direction by using a global average pooling layer and a full connection layer to generate channel statistical information, wherein the calculation mode is as follows:
wherein S represents channel statistics, W s Representing learnable parameters for linear transformation of pooled features,f (n, t:) represents the characteristics of the nth individual in the t frame in the F in the channel direction;
s33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and the softmax, wherein the calculation mode is as follows:
W b =softmax(W b S);
wherein, W b Weight vector of features branched into b, w b For branch b the mathematically linear transformation parameters that map the channel statistics S to weight vectors,
s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:
wherein,representing weighted featuresThe value of the c-th channel of (c),representing the weight of the feature in the c channel dimension of the b-th branch,is W b The value of the c-th element;a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,is X b The value of the c-th element.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210236706.2A CN114863356B (en) | 2022-03-10 | 2022-03-10 | Group activity identification method and system based on residual aggregation graph network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210236706.2A CN114863356B (en) | 2022-03-10 | 2022-03-10 | Group activity identification method and system based on residual aggregation graph network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114863356A CN114863356A (en) | 2022-08-05 |
CN114863356B true CN114863356B (en) | 2023-02-03 |
Family
ID=82627853
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210236706.2A Active CN114863356B (en) | 2022-03-10 | 2022-03-10 | Group activity identification method and system based on residual aggregation graph network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114863356B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401174A (en) * | 2020-03-07 | 2020-07-10 | 北京工业大学 | Volleyball group behavior identification method based on multi-mode information fusion |
WO2021032295A1 (en) * | 2019-08-21 | 2021-02-25 | Toyota Motor Europe | System and method for detecting person activity in video |
CN112613349A (en) * | 2020-12-04 | 2021-04-06 | 北京理工大学 | Time sequence action detection method and device based on deep hybrid convolutional neural network |
CN112699786A (en) * | 2020-12-29 | 2021-04-23 | 华南理工大学 | Video behavior identification method and system based on space enhancement module |
CN112818843A (en) * | 2021-01-29 | 2021-05-18 | 山东大学 | Video behavior identification method and system based on channel attention guide time modeling |
CN112926396A (en) * | 2021-01-28 | 2021-06-08 | 杭州电子科技大学 | Action identification method based on double-current convolution attention |
CN113435430A (en) * | 2021-08-27 | 2021-09-24 | 中国科学院自动化研究所 | Video behavior identification method, system and equipment based on self-adaptive space-time entanglement |
CN113673489A (en) * | 2021-10-21 | 2021-11-19 | 之江实验室 | Video group behavior identification method based on cascade Transformer |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11210523B2 (en) * | 2020-02-06 | 2021-12-28 | Mitsubishi Electric Research Laboratories, Inc. | Scene-aware video dialog |
CN112434608B (en) * | 2020-11-24 | 2023-02-28 | 山东大学 | Human behavior identification method and system based on double-current combined network |
-
2022
- 2022-03-10 CN CN202210236706.2A patent/CN114863356B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021032295A1 (en) * | 2019-08-21 | 2021-02-25 | Toyota Motor Europe | System and method for detecting person activity in video |
CN111401174A (en) * | 2020-03-07 | 2020-07-10 | 北京工业大学 | Volleyball group behavior identification method based on multi-mode information fusion |
CN112613349A (en) * | 2020-12-04 | 2021-04-06 | 北京理工大学 | Time sequence action detection method and device based on deep hybrid convolutional neural network |
CN112699786A (en) * | 2020-12-29 | 2021-04-23 | 华南理工大学 | Video behavior identification method and system based on space enhancement module |
CN112926396A (en) * | 2021-01-28 | 2021-06-08 | 杭州电子科技大学 | Action identification method based on double-current convolution attention |
CN112818843A (en) * | 2021-01-29 | 2021-05-18 | 山东大学 | Video behavior identification method and system based on channel attention guide time modeling |
CN113435430A (en) * | 2021-08-27 | 2021-09-24 | 中国科学院自动化研究所 | Video behavior identification method, system and equipment based on self-adaptive space-time entanglement |
CN113673489A (en) * | 2021-10-21 | 2021-11-19 | 之江实验室 | Video group behavior identification method based on cascade Transformer |
Non-Patent Citations (3)
Title |
---|
Coherence Constrained Graph LSTM;Jinhui Tang et.al;《IEEE》;20190715;636 - 647 * |
基于监控视频的群体行为识别算法研究;张乐军;《中国优秀硕士学位论文全文数据库信息科技辑》;20190715;全文 * |
基于行为特征优化和深度学习的行为识别研究;熊辛;《中国博士学位论文全文数据库信息科技辑》;20220215;45-87页 * |
Also Published As
Publication number | Publication date |
---|---|
CN114863356A (en) | 2022-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110084151B (en) | Video abnormal behavior discrimination method based on non-local network deep learning | |
CN109902564B (en) | Abnormal event detection method based on structural similarity sparse self-coding network | |
CN109767312B (en) | Credit evaluation model training and evaluation method and device | |
CN111737592B (en) | Recommendation method based on heterogeneous propagation collaborative knowledge sensing network | |
CN109522961B (en) | Semi-supervised image classification method based on dictionary deep learning | |
CN109344285A (en) | A kind of video map construction and method for digging, equipment towards monitoring | |
CN114692741B (en) | Generalized face counterfeiting detection method based on domain invariant features | |
CN111292195A (en) | Risk account identification method and device | |
CN112765370B (en) | Entity alignment method and device of knowledge graph, computer equipment and storage medium | |
Chi et al. | A decision support system for detecting serial crimes | |
CN112288034A (en) | Semi-supervised online anomaly detection method for wireless sensor network | |
CN111598032B (en) | Group behavior recognition method based on graph neural network | |
CN117591813B (en) | Complex equipment fault diagnosis method and system based on multidimensional features | |
CN118171171A (en) | Heat supply pipe network fault diagnosis method based on graphic neural network and multidimensional time sequence data | |
CN113343123A (en) | Training method and detection method for generating confrontation multiple relation graph network | |
CN114863356B (en) | Group activity identification method and system based on residual aggregation graph network | |
CN115761654B (en) | Vehicle re-identification method | |
CN116304906A (en) | Trusted graph neural network node classification method | |
CN116257786A (en) | Asynchronous time sequence classification method based on multi-element time sequence diagram structure | |
CN113312968A (en) | Real anomaly detection method in surveillance video | |
CN112927248A (en) | Point cloud segmentation method based on local feature enhancement and conditional random field | |
CN117828280B (en) | Intelligent fire information acquisition and management method based on Internet of things | |
Bai et al. | Neural ordinary differential equation model for evolutionary subspace clustering and its applications | |
CN118133190B (en) | Load identification model construction method and load identification method based on BN relation network | |
CN116543416A (en) | Unsupervised pedestrian re-identification method integrating relation features and content features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |