CN114863356B

CN114863356B - Group activity identification method and system based on residual aggregation graph network

Info

Publication number: CN114863356B
Application number: CN202210236706.2A
Authority: CN
Inventors: 李威; 吴晓; 杨添朝; 张基
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2023-02-03
Anticipated expiration: 2042-03-10
Also published as: CN114863356A

Abstract

The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and discloses a group activity identification method and a group activity identification system based on a residual aggregation graph network, wherein the method comprises the following steps: s1, extracting appearance characteristics; s2, double-branch reasoning; s3, weighting and fusing; and S4, predicting group activities. The invention solves the problems that video clips with similar individual actions but different group activity categories are difficult to effectively distinguish, importance screening on different semantic features is lacked and the like in the prior art.

Description

Group activity identification method and system based on residual aggregation graph network

Technical Field

The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and specifically relates to a group activity identification method and system based on a residual aggregation graph network.

Background

With the growing urban population and the dramatic increase in the flow of people in public areas, people monitoring and management face great challenges and pressures. If the video monitoring technology can be used for monitoring and alarming the abnormal group behavior phenomenon in the important area in time, relevant departments can take corresponding measures aiming at the early warning or alarming phenomenon in the shortest time to minimize the occurrence possibility of safety accidents and minimize the loss caused by the accidents. Therefore, more and more video monitoring systems are applied to public places to maintain public order and improve public area security, and group activity analysis is also receiving more and more attention.

The difficulty in group activity detection is that in addition to the need to master the actions of individuals, the potential connections between individuals in a group need to be mastered. Therefore, for better recognition of group activities, it is critical to utilize various information, such as appearance information, spatial location information, similarity relationship information, and difference information, among others.

For group activity recognition at present, the problem is mostly solved by a method based on a graph neural network. The method mainly comprises the following steps:

(1) and extracting appearance characteristics of each individual in a plurality of representative frames of the corresponding video clip through the basic network.

(2) And capturing the correlation among individuals in the group in a graph mode, and extracting the relation characteristics by graph convolution.

(3) And performing simple addition fusion and pooling operation on the individual appearance signs and the relationship characteristics to obtain video characteristics representing the whole video clip.

(4) And sending the video characteristics to a classifier to obtain a corresponding group activity classification result.

Such a method based on a graph neural network firstly ignores difference information (such as a small difference between close actions) existing between active persons in a video, which is very important for effectively distinguishing video segments with similar individual actions but different group activity categories; secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order reasoning characteristic, and the fusion mode lacks of importance screening of different semantic characteristics.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a group activity identification method and system based on a residual aggregation graph network, and solves the problems that video segments with similar individual actions but different group activity types are difficult to effectively distinguish, importance screening on different semantic features is lacked, and the like in the prior art.

The technical scheme adopted by the invention for solving the problems is as follows:

a group activity identification method based on a residual aggregation graph network comprises the following steps:

s1, appearance feature extraction: obtaining the appearance characteristics X of the individual level of the group to be identified by utilizing the key frames of the given video clip and the corresponding bounding boxes of each individual,

x _i ∈R ^D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i denotes the number of individuals in the video clip key-frame, i =1,2, \8230;, T × N; x is a radical of a fluorine atom _i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;

s2, double-branch reasoning: performing difference relation reasoning on the appearance characteristics based on residual error aggregation to obtain difference characteristics

And respectively carrying out similarity relation reasoning on the appearance characteristics based on a graph neural network to obtain relation characteristics X';

s3, weighted fusion: weighting and fusing the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtainWeighted features

S4, predicting group activities: and performing pooling operation on the weighted features to obtain a global representation representing the whole video segment, further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.

As a preferred technical solution, in step S2, performing difference relation inference based on residual aggregation on the appearance features to obtain a formula of difference features, where the formula is:

wherein j represents the number of an individual in a video clip, j =1,2, \8230;, T × N;

representing the difference characteristic of the jth individual,

is a difference feature

Is a function of one of the elements of (a),

r _i (x _j ) Representing the residual relationship between individual j and individual i,

representing the spatial position correlation, x, between an individual j and an individual i _j -x _i Representing the difference in appearance characteristics between individual j and individual i in different time slots.

As a preferred embodiment, r _i (x _j ) The calculation formula of (2) is as follows: :

wherein, w _j Representing a weight, b, mapping the apparent difference between two volumes centered on an individual j to a scalar _j Offset value w representing a scalar quantity that maps appearance differences between two volumes centered on an individual j _j ∈R ^D ，b _j ∈R ¹ 。

As a preferred technical solution, it is proposed that,

the calculation formula of (c) is:

wherein Π (·) represents an indicator function,

denotes the euclidean distance between the individual j and the individual i, and μ denotes a spatial restriction factor.

As a preferred technical solution, in step S2, the appearance features are respectively subjected to similarity relationship inference based on a graph neural network to obtain a formula of relationship features, where the formula is as follows:

wherein g represents the number of the relational diagram, N _g Representing the number of graphs, reLU () representing the ReLU activation function, G ^g Shows a relational graph with the number g, W _g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.

As a preferred embodiment, G ^g The calculation formula of (2) is as follows:

G ^g ＝{G _i，j ∈R ¹ |i，j＝1，...，T×N}，

wherein G is _i，j Representing the magnitude of the similarity relationship between the individual i and the individual j, f _a (x _i ，x _j ) Representing the apparent correlation between individual i and individual j.

As a preferred solution, f _a (x _i ，x _j ) The calculation formula of (2) is as follows:

wherein, theta (x) _i ) Appearance feature x representing the appearance of an individual i _i By D-dimension embedding

A linear transformation of the dimensional space is performed,

appearance feature x representing the individual j _j By D-dimension embedding

Linear transformation of dimensional space, tanspose () representing a transpose operation, d _k Indicating a normalization factor.

As a preferred technical solution, the step S3 includes the steps of,

s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:

wherein F represents an integerCombined characteristics, F ∈ R ^T×N×D B is the branch number; n is a radical of _b Representing the number of branches, N in the above formula _b Is 3,X ⁱ Representing appearance features, difference features or relationship features,

respectively represent different semantic features X,

And X';

s32, embedding global information in the channel direction by using a global average pooling layer and a full connection layer to generate channel statistical information, wherein the calculation mode is as follows:

wherein S represents channel statistics, W _s Representing learnable parameters for linear transformation of pooled features,

f (n, t:) represents the characteristics of the nth individual of the t frame in F in the channel direction;

s33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and the softmax, wherein the calculation mode is as follows:

W _b ＝softmax(w _b S)；

wherein, W _b Weight vector of features branched into b, w _b For the learnable linear transformation parameters of branch b that map the channel statistics S to weight vectors,

s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:

wherein,

representing weighted features

The value of the c-th channel of (c),

representing the weight of the feature in the c-th channel dimension of the b-th branch,

is W _b The value of the c-th element;

a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,

is X ^b The value of the c-th element.

As a preferred technical solution, in step S4, the global representation is passed through a full connection layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.

The system applied to the group activity recognition method based on the residual aggregation map network comprises an appearance characteristic extraction module, a two-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance characteristic extraction module is also electrically connected with the weighted fusion module;

wherein;

appearance characteristic extraction module: the method is used for obtaining appearance characteristics X of the individual level of the group to be identified by utilizing key frames of the given video clip and the bounding box of each individual,

x _i ∈R ^D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i denotes the number of individuals in the video clip key-frame, i =1,2, \8230;, T × N; x is the number of _i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;

a two-branch reasoning module: difference relation reasoning for performing residual error aggregation based on appearance characteristics to obtain difference characteristics

And performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';

a weighted fusion module: the method is used for performing weighted fusion on the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics

A group activity prediction module: the method comprises the steps of performing a pooling operation on weighted features to obtain a global representation representing the whole video segment, then further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.

Compared with the prior art, the invention has the following beneficial effects:

(1) The method utilizes the difference information among the actors in the video, which is very important for effectively distinguishing the video segments which have similar individual actions but different group activity categories; the method for capturing the potential useful difference information in the group and the method for fusing different semantic features by self-adaptive weighting greatly improve the accuracy of group activity detection;

(2) The local residual error aggregation network module provided by the invention can encode potential differences among all related actors in a crowd and provide additional clues for reasoning;

(3) The weighted fusion strategy provided by the invention can adaptively select more important information in different semantic features.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a group activity recognition method based on a residual aggregation graph network according to the present invention;

FIG. 2 is a block diagram of a group activity recognition system based on a residual aggregation graph network according to the present invention;

FIG. 3 is a schematic view of a workflow of a group activity recognition system based on a residual aggregation graph network according to the present invention;

FIG. 4 is a schematic diagram of a local residual aggregation network module according to the present invention;

FIG. 5 is a schematic structural diagram of a weighted fusion module according to the present invention;

FIG. 6 is a flow chart of model training according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.

Example 1

As shown in fig. 1 to 6, the present invention provides a group activity recognition method and a model training method, which can detect an action category of a group from a video.

The appearance information and the space-time information of the individuals are required to be extracted for detecting the actions of the individuals in the video, and each individual has own action behavior in a real scene, so that the identification of group activities is required to extract the appearance information and the space-time information of each individual and model the potential information among the individuals so as to deduce the potential relationship among the groups relative to the detection of the actions of the individuals.

Most population-based methods based on graph neural networks first ignore the difference information that exists between the active people in the video (such as the slight difference between close actions); secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order relation characteristic, and the fusion mode lacks the importance screening of different semantic characteristics. The method for searching the mode capable of capturing the potential useful difference information in the group and the method for adaptively weighting and fusing different semantic features are very important for improving the accuracy of group activity detection.

In order to solve the above problems, the present invention provides a group activity identification method and system based on a residual aggregation graph network. The local residual error aggregation network module is used for capturing potential useful difference information in a population, and can be combined with the relation inference network module based on the graph neural network to form double-branch inference. The weighting fusion module is used for adaptively weighting and fusing different semantic features to achieve the effect of screening important information.

The main technical solution of the present invention will be described in detail with reference to specific examples.

1. The group activity detection method based on the graph neural network and the residual error aggregation network comprises the following steps:

1. extracting basic appearance characteristics:

the key frames of the given video clip and the corresponding bounding boxes of each individual are subjected to individual-level appearance characteristics by using a backbone network and Roialign, and the appearance characteristics are expressed as

Wherein x is _i Representing the appearance characteristics, x, of each individual _i ∈R ^D 。

2. The two-branch reasoning network module:

and inputting the obtained appearance characteristic X into a local residual error aggregation network module and a relation inference network module based on a graph neural network respectively to carry out double-branch high-order inference. These two networks will be described separately below.

(1) Local residual aggregation network module (LR) ² M)：

Inputting the obtained appearance characteristic into LR ² M, modeling the difference information between individuals in the group to obtain the difference characteristics

(wherein

Is a difference characteristic

Is a function of one of the elements of (1),

representing the difference signature of the jth individual).

The calculation method of (c) is as follows:

wherein x is _j -x _i Representing the difference in appearance between the jth individual and the ith individual; r is a radical of hydrogen _i (x _j ) Representing the residual relationship between an individual j and an individual i, when there is a useful difference relationship between (j, i), r _i (x _j ) =1, when there is no useful difference relation r between (j, i) _i (x _j )＝0；

Is a space limitation.

To make the above equation differentiable so as to be able to propagate backwards during network training, we will let r _i (x _j ) Performing smoothing on r _i (x _j ) The calculation is shown below:

wherein, w _j Representing a weight, b, that maps the apparent difference between two volumes centered on an individual j as a scalar _j Offset value, w, representing a scalar quantity that maps appearance differences between two volumes centered on an individual j _j ∈R ^D ，b _j ∈R ¹ 。

Previous research experiments have demonstrated that local information is more conducive to group activity category reasoning. Therefore, the local residual relation is aggregated by using a distance mask, and the calculation method is as follows:

wherein pi (·) is an indication function, μ is a space limiting factor, and is a hyper-parameter, and the specific value is determined according to the situation;

is the euclidean distance between the individual j and the individual i.

When in use

Finally, the product is processed

The calculation method of (c) is as follows:

group activities are inferred by modeling local differences between the active people in the group to obtain useful difference information in the video clip.

(2) The relation reasoning module based on the neural network of the graph:

for relational modeling, a graph neural network is adopted to establish an actor-relational graph, and relational information between actors can be provided for group activity recognition by utilizing a graph structure.

Each node in the actor-relationship graph represents each individual, while the importance of the relationship between two individuals is represented by the edge weights between the two actors. The weight of each edge in the graph is determined by the appearance characteristics and the spatial position of the individuals at the two ends. The method for calculating the weight of the edge between two body nodes in the method comprises the following steps:

wherein, G _i，j Representing the magnitude of the similarity relationship between the individual i and the individual j, f _a (x _i ，x _j ) Representing the apparent correlation between individual i and individual j.

For the appearance characteristic correlation, an embedded dot product method is adopted, and the corresponding formula is as follows:

A linear transformation of the dimensional space is performed,

appearance feature x representing the individual j _j By D-dimension embedding

Linear transformation of the dimensional space, tanspose () representing a transpose operation, d _k Represents a normalization factor, d _k Is a constant.

For spatial position correlation, the same distance masking approach as the above local residual aggregation network is adopted:

wherein Π (·) represents an indicator function,

denotes the euclidean distance between individual j and individual i and μ denotes the spatial limiting factor.

Thus, for a graph of relationships between two actors in a population, it can be expressed as:

G ^g ＝{G _i，j ∈R ¹ |i，j＝1，...，T×W}；

and meanwhile, a plurality of relation graphs are constructed to capture different related information. In the present invention, a series of graphs are built

Each graph is computed separately and does not share weights. The multiple graphs can be established, and the model can be operated to merge and learn different types of relationship information, so that the model can make more reliable relationship reasoning.

After graph construction, single-layer relationship reasoning is implemented using GCN. For the relationship features, they are calculated as follows:

Difference characteristics are respectively obtained through the local residual error aggregation network module and the relation reasoning module based on the graph neural network

And a relation feature X' to form a two-branch reasoning network.

3. Weighted fusion module (WAS):

the appearance characteristics X and the differenceFeature(s)

And inputting the relation characteristic X' into the self-adaptive weighting fusion module, and performing weighting fusion on the three different semantic characteristics in the channel direction. The specific method comprises the following steps:

(1) firstly, integrating information of all branch characteristics, and obtaining the integrated characteristics through simple addition in element aspects, wherein the calculation mode is as follows:

wherein F represents the integrated characteristic, and F is epsilon of R ^T×N×D B is the branch number; n is a radical of _b Representing the number of branches, N in the above formula _b Is 3,X ⁱ Representing appearance features, difference features or relationship features,

respectively represent different semantic features X,

And X'.

(2) The global information is simply embedded in the channel direction using the global average pooling and full-link layers to generate channel statistics, calculated as follows:

f (n, t:) represents the characteristics of the nth individual of the t-th frame in F in the channel direction.

(3) The weights of the different branch features in the channel direction are obtained by simple full connectivity and softmax operations. The calculation method is as follows:

W _b ＝softmax(w _b S)；

(4) finally, the calculation mode of each one-dimensional feature in the channel direction of the weighted and fused feature is as follows:

wherein,

representing weighted features

The value of the c-th channel of (c),

is W _b The value of the c-th element;

is X ^b The value of the c-th element.

Through the WAS, the features under different semantics are subjected to adaptive weighted fusion, and information more important for group activity identification is screened out.

4. By applying the above-obtained weighting characteristics

A pooling operation is performed resulting in a global representation representing the entire video segment. And (4) obtaining the confidence of each group activity category of each frame by a simple full connection layer and then by softmax operation through the global representation. And predicting the group activity category by using the average value of the confidence degrees of the categories of each frame.

2. The model training method for group detection based on the graph neural network and the residual error aggregation network comprises the following steps:

1. and acquiring a video clip sample and a label corresponding to the sample, wherein the label represents the group activity of each key frame in the training sample.

2. Dividing the sample and the label thereof into two parts according to a proportion, wherein one part is a training set and is used for training the model; and a part is a verification set used for selecting the model.

3. And pre-training the backbone network by using the processed training set.

4. And (3) processing the samples in the training set, outputting a prediction result through a model, and calculating the loss of the prediction result and the real label by using cross loss entropy.

5. And training the model through back propagation and parameter updating, and performing reasoning test by using a verification set so as to select a better model result.

Example 2

As shown in fig. 1 to fig. 6, as a further optimization of the embodiment 1, this embodiment includes all the technical features of the embodiment 1, and in addition, this embodiment also includes the following technical features:

firstly, inputting a key frame of a given video clip and a corresponding bounding box of each individual into a backbone network by utilizing the backbone network and Roialign to obtain the appearance characteristics of the individual; secondly, inputting the appearance characteristics into a local residual error aggregation network module and a relation reasoning module based on a graph neural network to obtain corresponding difference characteristics and relation characteristics; then, performing self-adaptive weighted fusion on the appearance characteristic, the difference characteristic and the relation characteristic through a weighted fusion module to obtain a fused characteristic; and finally, performing pooling operation on the fused features to obtain video global representation, and inputting the video global representation into a classifier to obtain a final classification result.

In the local residual error aggregation network module, the appearance characteristics of each individual are input into the local residual error aggregation module, residual errors between every two individuals and residual error correlation coefficients corresponding to the residual errors are respectively calculated, and finally the difference characteristics of each individual are calculated under the constraint of spatial positions.

In the task of group action recognition, we put the proposed solution on two reference datasets: the Volleyball dataset and the Collective Activity dataset were compared to the most advanced methods. And we used two indices to evaluate model accuracy, MCA (multi-class classification accuracy) and MPCA (average per class accuracy), respectively.

For the Volleyball data set, the MCA is improved by 2.6% on the Volleyball data set through a proposed two-branch inference mode formed by a local residual error network module and a graph neural network-based relation inference module and a proposed weighting fusion module on the basis of ARG. Compared with the advanced method DIN based on the neural network of the graph, the method provided by the invention respectively improves MCA and MPCA by 0.9% and 1.2% under the condition of taking VGG16 as the backbone.

For the Collective Activity dataset, our proposed method improved MCA by 5.1% with VGG16 as backbone and ARG as baseline. Compared with advanced methods DIN based on a graph neural network, the method respectively takes RestNet18 and VGG16 as backbones, and the method is respectively improved by 0.8 percent and 0.6 percent in MPCA

In this method we take 3 representative frames from each video segment as input and clip each frame of the Volleyball dataset to a size of 720 × 1280 and each frame of the Collective Activity dataset to a size of 480 × 720. The method adopts RestNet18 or VGG16 as a backbone network. Adaptive momentum estimation (Adam) optimizerIn a training model, wherein ₁ ＝0.9，β ₂ ＝0.999，∈＝10 ^-8 . For the Volleyball dataset, the initial learning rate is set to 1e ^-4 The learning rate is updated every 10 epochs at a decay rate of 0.3, with a training period of 40. For the Collective Activity dataset, the RestNet18 or VGG16 learning rates are set to 4e for ^-5 And 1e ^-4 The training period is 30. The spatial limiting factor of the local residual aggregation network module is set to a graph width of 0.2 and 0.3 in the Volleyball data set and the Collective Activity data set, respectively. Canonical factor d of relational inference module based on graph neural network _k Set to 256. The dimension shrinkage factor of the weighted fusion module is set to 16. The batch size of both datasets was set to 2 at training.

The accuracy on these two reference data sets reflects the advancement of our method. For analysis reasons, the method can have two main advantages: (1) The local residual error aggregation network module provided by the method can encode potential differences among all related actors in the crowd and provide additional clues for reasoning; (2) The weighting fusion strategy provided by the method can adaptively select more important information in different semantic characteristic values.

As described above, the present invention can be preferably realized.

All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims

1. A group activity identification method based on a residual aggregation graph network is characterized by comprising the following steps:

s1, appearance feature extraction: obtaining the appearance characteristics X of the individual level of the group to be identified by utilizing the key frames of the given video clip and the bounding boxes of each individual,

x _i ∈R ^D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i represents the number of individuals in the key frame of the video segment, i =1,2, \8230;, T × N; x is a radical of a fluorine atom _i Representing the appearance characteristics of individuals numbered i in the video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;

s3, weighted fusion: weighting and fusing the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics

S4, predicting group activities: pooling the weighted features to obtain a global representation representing the whole video segment, further processing to obtain a confidence coefficient of each frame for the group activity category, and predicting the group activity category by using an average value of the confidence coefficients of each category of each frame;

in step S2, the appearance features are subjected to difference relationship inference based on residual aggregation to obtain a difference feature formula as follows:

representing the difference characteristic of the jth individual,

is a difference characteristic

Is a function of one of the elements of (1),

representing the spatial position correlation, x, between an individual j and an individual i _j -x _i Representing the difference in appearance characteristics between an individual j and an individual i in different sky;

the step S3 comprises the following steps,

wherein F represents the integrated characteristic, and F is epsilon of R ^T×N×D B is the branch number; n is a radical of _b Representing the number of branches, N in the above formula _b Is 3,X ^l Representing appearance features, difference features or relationship features,

respectively represent different semantic features X,

And X';

where s represents channel statistics, W _s Representing learnable parameters for linear transformation of pooled features,

f (n, t:) represents the characteristics of the nth individual in the t frame in the F in the channel direction;

W _b ＝softmax(w _b S)；

wherein, W _b Weight vector of features branched into b, w _b For branch b the mathematically linear transformation parameters that map the channel statistics S to weight vectors,

wherein,

representing weighted features

The value of the c-th channel of (c),

representing the weight of the feature in the c channel dimension of the b-th branch,

is W _b The value of the c-th element;

is X ^b The value of the c-th element.

2. The method of claim 1, wherein r is the group activity recognition method based on the residual aggregation graph network _i (x _j ) The calculation formula of (2) is as follows:

wherein, w _j Representing a weight, b, that maps the apparent difference between two volumes centered on an individual j as a scalar _j Offset value, w, representing a scalar quantity that maps the apparent difference between two volumes centered on an individual j _j ∈R ^D ，b _j ∈R ¹ 。

3. The method according to claim 2, wherein the group activity recognition method based on the residual aggregation graph network,

the calculation formula of (c) is:

wherein pi (·) represents an indicator function,

4. The group activity recognition method based on the residual aggregation graph network as claimed in claim 1, wherein in step S2, the appearance features are respectively subjected to graph neural network-based similarity relationship inference to obtain a relationship feature formula as follows:

5. The method of claim 4, wherein G is a group activity recognition method based on the residual aggregation graph network ^g The calculation formula of (2) is as follows:

G ^g ＝{G _i，j ∈R ¹ |i，j＝1，...，T×N}，

6. The method according to claim 5, wherein f is a group activity recognition method based on a residual aggregation graph network _a (x _i ，x _j ) The calculation formula of (2) is as follows:

wherein, theta (x) _i ) Appearance characteristic x representing the individual i _i By D-dimension embedding

A linear transformation of the dimensional space is performed,

appearance feature x representing the individual j _j By D-dimension embedding

Linear transformation of dimensional space, tansp. se () represents a transpose operation, d _k Indicating a normalization factor.

7. The method of claim 6, wherein in step S4, the global representation is passed through a full link layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.

8. The system applied to the group activity identification method based on the residual aggregation map network is characterized by comprising an appearance feature extraction module, a double-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance feature extraction module is also electrically connected with the weighted fusion module;

wherein;

appearance feature extraction module: the method is used for obtaining appearance characteristics X of the individual level of the group to be identified by utilizing key frames of the given video clip and the bounding box of each individual,

x _i ∈R ^D wherein, T represents the frame number of the key frame of the video clip, and N represents the number of each key frame in the video clip; i represents the number of individuals in the key frame of the video segment, i =1,2, \8230;, T × N; x is the number of _i Representing the appearance characteristics of individuals numbered i in the video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;

A group activity prediction module: performing pooling operation on the weighted features to obtain a global representation representing the whole video segment, further processing to obtain confidence of each frame to the group activity category, and predicting the group activity category by using an average value of the confidence of each category of each frame;

when the dual-branch reasoning module works, the appearance characteristics are subjected to difference relation reasoning based on residual error aggregation to obtain a formula of difference characteristics, wherein the formula is as follows:

representing the difference characteristic of the jth individual,

is a difference feature

Is a function of one of the elements of (a),

representing the spatial position correlation, x, between an individual j and an individual i _j -x _i Representing the difference of appearance characteristics between an individual j and an individual i in different sky;

the weighted fusion module performs the following steps,

wherein F represents the integrated characteristic, and F is equal to R ^T×N×D B is the branch number; n is a radical of hydrogen _b Representing the number of branches, N in the above formula _b Is 3,X ^l Indicating appearance, difference, or offIs characterized in that the first and second liquid crystal panels,