CN114863356A

CN114863356A - Group activity identification method and system based on residual aggregation graph network

Info

Publication number: CN114863356A
Application number: CN202210236706.2A
Authority: CN
Inventors: 李威; 吴晓; 杨添朝; 张基
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-08-05
Anticipated expiration: 2042-03-10
Also published as: CN114863356B

Abstract

The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and discloses a group activity identification method and a group activity identification system based on a residual aggregation graph network, wherein the method comprises the following steps: s1, extracting appearance features; s2, performing double-branch reasoning; s3, weighted fusion; and S4, predicting the group activity. The invention solves the problems that video clips with similar individual actions but different group activity categories are difficult to effectively distinguish, importance screening on different semantic features is lacked and the like in the prior art.

Description

Group activity identification method and system based on residual aggregation graph network

Technical Field

The invention relates to the technical field of data identification, in particular to the technical field of intelligent video image analysis, and specifically relates to a group activity identification method and system based on a residual aggregation graph network.

Background

With the growing urban population and the dramatic increase in the flow of people in public areas, people monitoring and management face great challenges and pressures. If the video monitoring technology can be used for monitoring and alarming the abnormal group behavior phenomenon in the important area in time, relevant departments can take corresponding measures aiming at the early warning or alarming phenomenon in the shortest time to minimize the occurrence possibility of safety accidents and minimize the loss caused by the accidents. Therefore, more and more video monitoring systems are applied to public places to maintain public order and improve public area security, and group activity analysis is also receiving more and more attention.

The difficulty of group activity detection is that potential connections between individuals in a group need to be mastered in addition to the actions of the individuals. Therefore, for better recognition of group activities, it is critical to utilize various information, such as appearance information, spatial location information, similarity relationship information, and difference information, among others.

For group activity recognition at present, the problem is mostly solved by a method based on a graph neural network. The method mainly comprises the following steps:

extracting appearance characteristics of each individual in a plurality of representative frames of the corresponding video clip through a basic network.

Secondly, capturing the correlation among individuals in the group in a graph mode, and extracting the relation characteristics by graph convolution.

And thirdly, carrying out simple addition fusion and pooling operation on the individual appearance signs and the relationship characteristics to obtain the video characteristics representing the whole video clip.

And fourthly, sending the video features into a classifier to obtain a corresponding group activity classification result.

Such a method based on a graph neural network firstly ignores difference information (such as a slight difference between close actions) existing between active persons in a video, which is very important for effectively distinguishing video segments with similar individual actions but different group activity categories; secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order reasoning characteristic, and the fusion mode lacks importance screening of different semantic characteristics.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a group activity identification method and system based on a residual aggregation graph network, and solves the problems that video segments with similar individual actions but different group activity types are difficult to effectively distinguish, importance screening on different semantic features is lacked, and the like in the prior art.

The technical scheme adopted by the invention for solving the problems is as follows:

a group activity identification method based on a residual aggregation graph network comprises the following steps:

s1, extracting appearance features: obtaining the appearance characteristics X of the individual level of the group to be identified by utilizing the key frames of the given video clip and the corresponding bounding boxes of each individual,

x _i ∈R ^D wherein T represents the frame number of the key frame of the video clip, and N represents the individual number of each key frame in the video clip; i represents the number of an individual in the video clip key frame, i is 1, 2, …, T × N; x is the number of _i Representing the appearance characteristics of an individual with the number i in a video clip, wherein R represents a linear space in a real number domain, and D represents the dimension of the appearance characteristics of the linear space R;

s2, two-branch reasoning: performing difference relation reasoning on the appearance characteristics based on residual error aggregation to obtain difference characteristics

And performing similarity relationship reasoning on the appearance characteristics based on a graph neural network to obtain relationship characteristics X';

s3, weighted fusion: will look like the characteristicCarrying out weighted fusion on the difference characteristic and the relation characteristic in the channel direction to obtain a weighted characteristic

S4, group activity prediction: and performing pooling operation on the weighted features to obtain a global representation representing the whole video segment, further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.

As a preferable technical solution, in step S2, the appearance feature is subjected to difference relation inference based on residual error aggregation to obtain a difference feature formula as follows:

where j denotes the number of an individual in a video clip, j ═ 1, 2, …, txn;

representing the difference characteristic of the jth individual,

is a difference characteristic

Is a function of one of the elements of (1),

r _i (x _j ) Representing the residual relationship between individual j and individual i,

representing the spatial position correlation, x, between an individual j and an individual i _j -x _i Representing different time of flightThe difference in appearance characteristics between individual j and individual i.

As a preferred embodiment, r _i (x _j ) The calculation formula of (2) is as follows:

wherein, w _j Representing a weight, b, mapping the apparent difference between two volumes centered on an individual j to a scalar _j Offset value, w, representing a scalar quantity that maps appearance differences between two volumes centered on an individual j _j ∈R ^D ，b _j ∈R ¹ 。

As a preferred technical solution, it is proposed that,

the calculation formula of (2) is as follows:

wherein Π (·) represents an indicator function,

denotes the euclidean distance between individual j and individual i and μ denotes the spatial limiting factor.

As a preferable technical solution, in step S2, the appearance features are respectively subjected to similarity relationship inference based on the neural network of the graph to obtain a formula of relationship features:

wherein g represents the number of the relational diagram, N _g Representing the number of graphs, ReLU () representing the ReLU activation function, G ^g Shows a relationship diagram with the number g, W _g A weight matrix which is linearly transformed corresponding to the relational graph with the number g is shown.

As a preferred embodiment, G ^g Is calculated from the equation:

G ^g ＝{G _i，j ∈R ¹ |i，j＝1，...，T×N}，

wherein G is _i，j Representing the magnitude of the similarity relationship between the individual i and the individual j, f _a (x _i ，x _j ) Representing the apparent correlation between individual i and individual j.

As a preferred solution, f _a (x _i ，x _j ) Is calculated from the equation:

wherein, theta (x) _i ) Appearance feature x representing the appearance of an individual i _i By D-dimension embedding

A linear transformation of the dimensional space is performed,

appearance feature x representing the individual j _j By D-dimension embedding

Linear transformation of dimensional space, Tanspose () representing a transpose operation, d _k Indicating a normalization factor.

As a preferable technical solution, the step S3 includes the steps of,

s31, adding the elements of the appearance characteristic, the difference characteristic and the relation characteristic to obtain an integrated characteristic, wherein the calculation mode is as follows:

wherein F represents the integrated characteristic, and F is equal to R ^T×N×D B is the branch number; n is a radical of _b Representing the number of branches, N in the above formula _b Is 3, X ⁱ Representing appearance features, difference features or relationship features,

respectively represent different semantic features X,

And X';

s32, embedding global information in the channel direction by using the global average pooling and the full connection layer to generate channel statistical information, wherein the calculation method is as follows:

wherein S represents channel statistics, W _s Representing learnable parameters for linear transformation of pooled features,

f (n, t:) represents the characteristics of the nth individual of the t-th frame in F in the channel direction.

S33, obtaining the weights of different branch characteristics in the channel direction through the operation of the full connection layer and softmax, wherein the calculation mode is as follows:

W _b ＝softmax(w _b S)；

wherein, W _b Weight vector of features branched into b, w _b For the learnable linear transformation parameters of branch b that map the channel statistics S to weight vectors,

s34, calculating each one-dimensional feature in the channel direction of the weighted and fused features in the following way:

wherein the content of the first and second substances,

representing weighted features

The value of the c-th channel of (c),

representing the weight of the feature in the c channel dimension of the b-th branch,

is W _b The value of the c-th element;

a value representing a characteristic in the channel dimension of the c-th branch of the b-th branch,

is X ^b The value of the c-th element.

As a preferred technical solution, in step S4, the global representation is passed through a full connection layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.

The system applied to the group activity recognition method based on the residual aggregation map network comprises an appearance characteristic extraction module, a two-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance characteristic extraction module is also electrically connected with the weighted fusion module;

wherein;

appearance characteristic extraction module: the method is used for obtaining appearance characteristics X of the individual level of the group to be identified by utilizing key frames of the given video clip and the bounding box of each individual,

a two-branch reasoning module: difference relation reasoning for performing residual error aggregation based on appearance characteristics to obtain difference characteristics

a weighted fusion module: the method is used for performing weighted fusion on the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics

A group activity prediction module: the method comprises the steps of performing a pooling operation on weighted features to obtain a global representation representing the whole video segment, then further processing to obtain the confidence of each frame to the group activity category, and predicting the group activity category by using the average value of the confidence of each category of each frame.

Compared with the prior art, the invention has the following beneficial effects:

(1) the method utilizes the difference information among the actors in the video, which is very important for effectively distinguishing the video segments which have similar individual actions but different group activity categories; by means of a method capable of capturing potential useful difference information in a group and a self-adaptive method for weighting and fusing different semantic features, the group activity detection precision is greatly improved;

(2) the local residual error aggregation network module provided by the invention can encode potential differences among all related actors in a crowd and provide additional clues for reasoning;

(3) the weighted fusion strategy provided by the invention can adaptively select more important information in different semantic features.

Drawings

FIG. 1 is a schematic diagram illustrating steps of a group activity recognition method based on a residual aggregation graph network according to the present invention;

FIG. 2 is a block diagram of a group activity recognition system based on a residual aggregation graph network according to the present invention;

FIG. 3 is a schematic view of a workflow of a group activity recognition system based on a residual aggregation graph network according to the present invention;

FIG. 4 is a schematic structural diagram of a local residual aggregation network module according to the present invention;

FIG. 5 is a schematic structural diagram of a weighted fusion module according to the present invention;

FIG. 6 is a flow chart of model training according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited to these examples.

Example 1

As shown in fig. 1 to 6, the present invention provides a group activity recognition method and a model training method, which can detect an action category of a group from a video.

The detection of the motion of an individual in a video requires extraction of appearance information and spatio-temporal information of the individual, and in a real scene, each individual has own motion behavior, so that the identification of group activities is performed by extracting the appearance information and the spatio-temporal information of each individual and modeling potential information among the individuals so as to deduce the potential relationship among groups.

Most population-based methods based on graph neural networks first ignore the difference information that exists between the active people in the video (such as the slight difference between close actions); secondly, equal-weight fusion is adopted for the fusion of the low-order appearance characteristic and the high-order relation characteristic, and the fusion mode lacks the importance screening of different semantic characteristics. The method for searching the mode capable of capturing the potential useful difference information in the group and the method for adaptively weighting and fusing different semantic features are very important for improving the accuracy of group activity detection.

In order to solve the above problems, the present invention provides a group activity identification method and system based on a residual aggregation graph network. The local residual error aggregation network module is used for capturing potential useful difference information in a population, and can be combined with a relational inference network module based on a graph neural network to form double-branch inference. The weighting fusion module is used for adaptively weighting and fusing different semantic features to achieve the effect of screening important information.

The main technical solution of the present invention will be described in detail with reference to specific examples.

A group activity detection method based on a graph neural network and a residual error aggregation network comprises the following steps:

1. extracting basic appearance characteristics:

key frames of a given video segment and corresponding bounding boxes of each individual are subjected to individual-level appearance characteristics by using a backbone network and Roiliaign, and the appearance characteristics are expressed as

Wherein x is _i Representing the appearance characteristics, x, of each individual _i ∈R ^D 。

2. The two-branch reasoning network module:

and respectively inputting the obtained appearance characteristics X into a local residual error aggregation network module and a relation inference network module based on a graph neural network to carry out double-branch high-order inference. These two networks will be described separately below.

Partial residual aggregation network module (LR) ² M)：

Inputting the obtained appearance characteristic into LR ² M, modeling difference information among individuals in a group to obtain difference characteristics

(wherein

Is a difference characteristic

Is a function of one of the elements of (a),

representing the difference signature of the jth individual).

The calculation method of (c) is as follows:

wherein x is _j -x _i Representing the difference in appearance between the jth individual and the ith individual; r is _i (x _j ) Representing the residual relationship between an individual j and an individual i, when there is a useful difference relationship between (j, i), r _i (x _j ) 0, when there is no useful difference relation r between (j, i) _i (x _j )＝1；

Is a space limitation.

To make the above equation differentiable so as to be able to propagate backwards during network training, we will let r _i (x _j ) Performing smoothing on r _i (x _j ) The calculation is shown below:

wherein, w _j Meaning that between two bodies, the individual j will be centeredIs mapped as a scalar weight, b _j Offset value, w, representing a scalar quantity that maps appearance differences between two volumes centered on an individual j _j ∈R ^D ，b _j ∈R ¹ 。

Previous research experiments have demonstrated that local information is more favorable for group activity category reasoning. Therefore, the local residual relation is aggregated by using a distance mask, and the calculation method is as follows:

wherein pi (·) is an indication function, μ is a space limiting factor, and is a hyper-parameter, and the specific value is determined according to the situation;

is the euclidean distance between individual j and individual i.

When in use

Finally, the product is processed

The calculation method of (c) is as follows:

group activities are inferred by modeling local differences between the activities in the group as described above to obtain useful difference information in the video segments.

A relation reasoning module based on the neural network of the graph:

for relational modeling, a graph neural network is adopted to establish an actor-relational graph, and relational information between actors can be provided for group activity recognition by utilizing a graph structure.

Each node in the actor-relationship graph represents each individual, and the importance of the relationship between two individuals is represented by the edge weight between the two actors. The weight of each edge in the graph is determined by the appearance characteristics and the spatial position of the individuals at the two ends. The method for calculating the weight of the edge between two body nodes in the method comprises the following steps:

wherein, G _i，j Representing the magnitude of the similarity relationship between the individual i and the individual j, f _a (x _i ，x _j ) Representing the apparent correlation between individual i and individual j.

For the appearance characteristic correlation, an embedded dot product method is adopted, and the corresponding formula is as follows:

A linear transformation of the dimensional space is performed,

appearance feature x representing the individual j _j By D-dimension embedding

Linear transformation of the dimensional space, Tanspose () representing a transpose operation, d _k Denotes a normalization factor, d _k Is a constant.

For spatial position correlation, the same distance masking method as the local residual aggregation network is adopted:

wherein Π (·) represents an indicator function,

Thus, for a graph of relationships between two actors in a population, it can be expressed as:

G ^g ＝{G _i，j ∈R ¹ |i，j＝1，...，T×N}；

and meanwhile, a plurality of relation graphs are constructed to capture different related information. In the present invention, a series of graphs are built

Each graph is computed separately and does not share weights. The multiple graphs can be established, and the model can be operated to merge and learn different types of relationship information, so that the model can make more reliable relationship reasoning.

After graph construction, single-layer relationship reasoning is implemented using GCN. For the relationship features, it is calculated as follows:

Difference characteristics are respectively obtained through the local residual error aggregation network module and the relation reasoning module based on the graph neural network

And relational featuresX', forming a two-branch inference network.

3. Weighted fusion module (WAS):

the appearance characteristic X and the difference characteristic are combined

And inputting the relation characteristic X' into the self-adaptive weighting fusion module, and performing weighting fusion on the three different semantic characteristics in the channel direction. The specific method comprises the following steps:

firstly, integrating information of all branch characteristics, and obtaining integrated characteristics through simple addition in element aspects, wherein the calculation mode is as follows:

respectively represent different semantic features X,

And X'.

Secondly, global information is simply embedded in the channel direction by using a global average pooling layer and a full connection layer to generate channel statistical information, and the calculation mode is as follows:

f (n, t:) representsAnd F, characteristics of the nth individual in the tth frame in the channel direction.

And thirdly, obtaining the weights of different branch characteristics in the channel direction through simple full-connection layer and softmax operation. The calculation method is as follows:

W _b ＝softmax(w _b S)；

fourthly, finally, the calculation mode of each one-dimensional feature in the channel direction of the feature after weighted fusion is as follows:

wherein the content of the first and second substances,

representing weighted features

The value of the c-th channel of (c),

is W _b The value of the c-th element;

is X ^b The value of the c-th element.

Through the WAS, the features under different semantics are subjected to self-adaptive weighted fusion, and information more important for group activity recognition is screened out.

4. By applying the above-obtained weighting characteristics

A pooling operation is performed resulting in a global representation representing the entire video segment. The global representation is processed through a simple full connection layer and then through the softmax operation to obtain the confidence coefficient of each group activity category of each frame. And predicting the group activity category by using the average value of the confidence degrees of the categories of each frame.

Secondly, a model training method for population detection based on a graph neural network and a residual error aggregation network is as follows:

1. and acquiring a video clip sample and a label corresponding to the sample, wherein the label represents the group activity of each key frame in the training sample.

2. Dividing the sample and the label thereof into two parts according to a proportion, wherein one part is a training set and is used for training the model; and a part is a verification set used for selecting the model.

3. And pre-training the backbone network by using the processed training set.

4. And (3) processing the samples in the training set, outputting a prediction result through a model, and calculating the loss of the prediction result and the real label by using cross loss entropy.

5. And training the model through back propagation and parameter updating, and carrying out reasoning test by using a verification set so as to select a better model result.

Example 2

As shown in fig. 1 to fig. 6, as a further optimization of the embodiment 1, this embodiment includes all the technical features of the embodiment 1, and in addition, this embodiment also includes the following technical features:

firstly, inputting a key frame of a given video clip and a corresponding bounding box of each individual into a backbone network by utilizing the backbone network and Roiliaign to obtain the appearance characteristics of the individual; secondly, inputting the appearance characteristics into a local residual error aggregation network module and a relation reasoning module based on a graph neural network to obtain corresponding difference characteristics and relation characteristics; then, performing self-adaptive weighted fusion on the appearance characteristic, the difference characteristic and the relation characteristic through a weighted fusion module to obtain a fused characteristic; and finally, performing pooling operation on the fused features to obtain video global representation, and inputting the video global representation into a classifier to obtain a final classification result.

In the local residual error aggregation network module, the appearance characteristics of each individual are input into the local residual error aggregation module, residual errors between every two individuals and residual error correlation coefficients corresponding to the residual errors are respectively calculated, and finally the difference characteristics of each individual are calculated under the constraint of spatial positions.

In the task of group action recognition, we put the proposed solution on two reference datasets: the Volleyball dataset and the Collective Activity dataset were compared to the most advanced methods. And we used two indices to evaluate model accuracy, MCA (multi-class classification accuracy) and MPCA (average per class accuracy), respectively.

For the Volleyball data set, the MCA is improved by 2.6% on the Volleyball data set through a proposed two-branch inference mode formed by a local residual error network module and a graph neural network-based relation inference module and a proposed weighting fusion module on the basis of ARG. Compared with the advanced method DIN based on the graph neural network, the method provided by the invention respectively improves MCA and MPCA by 0.9% and 1.2% under the condition of taking VCC16 as the backbone.

For the Collective Activity dataset, our proposed method improved MCA by 5.1% with VGG16 as backbone and ARG as baseline. Compared with the advanced method DIN based on the graph neural network, under the condition that RestNet18 and VGG16 are respectively used as backbones, the method provided by the invention respectively improves MPCA by 0.8 percent and 0.6 percent

In this method we extract 3 representative frames from each video segment as input and crop each frame of the Volleyball dataset to a size of 720 x 1280, per Collective Activity datasetOne frame is cropped to 480 × 720 size. The method adopts RestNet18 or VGG16 as a backbone network. An adaptive momentum estimation (Adam) optimizer is used to train the model, where β ₁ ＝0.9，β ₂ ＝0.999，∈＝10 ^-8 . For the Volleyball dataset, the initial learning rate is set to 1e ^-4 The learning rate is updated every 10 epochs at a decay rate of 0.3, with a training period of 40. For the Collective Activity dataset, for RestNet18 or VGG16 learning rates are set to 4e respectively ^-5 And 1e ^-4 The training period is 30. The spatial limiting factor of the local residual aggregation network module is set to a graph width of 0.2 and 0.3 in the Volleyball data set and the Collective Activity data set, respectively. Canonical factor d of relational inference module based on graph neural network _k Set to 256. The dimension shrinkage factor of the weighted fusion module is set to 16. The batch size of both datasets was set to 2 at training.

The accuracy on these two reference data sets reflects the advancement of our method. For analysis reasons, the method can have two main advantages: (1) the local residual error aggregation network module provided by the method can encode potential differences among all related actors in the crowd and provide additional clues for reasoning; (2) the weighting fusion strategy provided by the method can adaptively select more important information in different semantic characteristic values.

As described above, the present invention can be preferably realized.

All features disclosed in all embodiments in this specification, or all methods or process steps implicitly disclosed, may be combined and/or expanded, or substituted, in any way, except for mutually exclusive features and/or steps.

The foregoing is only a preferred embodiment of the present invention, and the present invention is not limited thereto in any way, and any simple modification, equivalent replacement and improvement made to the above embodiment within the spirit and principle of the present invention still fall within the protection scope of the present invention.

Claims

1. A group activity identification method based on a residual aggregation graph network is characterized by comprising the following steps:

s2, double branch reasoning: performing difference relation reasoning on the appearance characteristics based on residual error aggregation to obtain difference characteristics

s3, weighted fusion: weighting and fusing the appearance characteristics, the difference characteristics and the relation characteristics in the channel direction to obtain weighted characteristics

2. The method for group activity recognition based on residual aggregation graph network according to claim 1, wherein in step S2, the formula for performing the difference relationship inference based on residual aggregation on the appearance features to obtain the difference features is as follows:

representing the difference characteristic of the jth individual,

is a difference characteristic

Is a function of one of the elements of (1),

representing the spatial position correlation, x, between an individual j and an individual i _j -x _i Representing the difference in appearance characteristics between individual j and individual i in different time slots.

3. The method for group activity recognition based on residual aggregation graph network according to claim 2, wherein r is _i (x _j ) The calculation formula of (2) is as follows:

4. The method according to claim 3, wherein the group activity recognition method based on the residual aggregation graph network,

the calculation formula of (2) is as follows:

wherein Π (·) represents an indicator function,

5. The method for group activity recognition based on residual aggregation graph network according to claim 1, wherein in step S2, the similarity relationship inference based on graph neural network is performed on the appearance features respectively to obtain a formula of relationship features:

6. The method of claim 5A group activity identification method based on a residual aggregation graph network is characterized in that G ^g The calculation formula of (2) is as follows:

G ^g ＝{G _i，j ∈R ¹ |i，j＝1，...，T×N}，

7. The method according to claim 6, wherein f is a group activity recognition method based on a residual aggregation graph network _a (x _i ，x _j ) The calculation formula of (2) is as follows:

A linear transformation of the dimensional space is performed,

appearance feature x representing the individual j _j By D-dimension embedding

8. The method for group activity recognition based on residual aggregation graph network as claimed in claim 7, wherein the step S3 comprises the following steps,

respectively represent different semantic features X,

And X';

W _b ＝softmax(w _b S)；

wherein, W _b Weight vector of features branched into b, w _b For branch bThe channel statistics S are mapped to the learnable linear transformation parameters of the weight vector,

wherein the content of the first and second substances,

representing weighted features

The value of the c-th channel of (c),

is W _b The value of the c-th element;

is X ^b The value of the c-th element.

9. The method of claim 8, wherein in step S4, the global representation is passed through a full link layer and then through softmax operation, so as to obtain the confidence of each frame with respect to the group activity category.

10. The system applied to the group activity recognition method based on the residual aggregation map network is characterized by comprising an appearance feature extraction module, a double-branch reasoning module, a weighted fusion module and a group activity prediction module which are electrically connected in sequence, wherein the appearance feature extraction module is also electrically connected with the weighted fusion module;

wherein;