CN111414846B

CN111414846B - Group behavior identification method based on key space-time information driving and group co-occurrence structural analysis

Info

Publication number: CN111414846B
Application number: CN202010192335.3A
Authority: CN
Inventors: 王传旭; 薛豪; 邓海刚; 闫春娟
Original assignee: Qingdao University of Science and Technology
Current assignee: Shenzhen Litong Information Technology Co ltd
Priority date: 2020-03-18
Filing date: 2020-03-18
Publication date: 2023-06-02
Anticipated expiration: 2040-03-18
Also published as: CN111414846A

Abstract

The invention discloses a group behavior identification method based on key time-space information driving and group co-occurrence structural analysis, 1) obtaining importance weights of each member in a group based on a key person candidate sub-network; 2) Inputting the personal importance weight and the boundary box characteristics into a main network CNN to obtain the spatial characteristics input into the laminated LSTM network; 3) Modeling the co-occurrence characteristics by taking the output of the 2) as input, and grouping the internal neurons of the laminated LSTM to realize different groups to learn different co-occurrence characteristics so as to obtain group characteristics; 4) Inputting the boundary frame characteristics into a key time segment candidate sub-network for characteristic extraction, and obtaining the importance weight of the current frame; 5) And (3) combining the group characteristics obtained in the step (3) with the importance weights of the current frame obtained in the step (4) to obtain the group characteristics of the current frame, and inputting the group characteristics into softmax for group behavior recognition to finish the classification task. The scheme extracts important member characteristics of the group and key scene frames based on key space-time information, and combines interaction information inside the co-occurrence processing group behaviors to achieve improvement of recognition accuracy of the group behaviors.

Description

Group behavior identification method based on key space-time information driving and group co-occurrence structural analysis

Technical Field

The invention relates to the field of group behavior identification, in particular to a group behavior identification method based on key space-time information driving and group co-occurrence structural analysis.

Background

In recent years, human behavior recognition in video has gained attention in the field of computer vision. Human behavior recognition is also widely applied in real life, such as intelligent video monitoring, abnormal event detection, sports analysis, social behavior understanding and the like, and the application of the human behavior recognition makes group behavior recognition have important scientific practicability and huge economic value. Group behavior recognition is a complex activity commonly performed by multiple people, and most important in group behavior recognition methods are research of personal characteristics and how to infer group behavior with individuals.

"A Hierarchical Deep Temporal Model for Group Activity Recognition", published in CVPR in 2016, built a deep model to capture LSTM model-based dynamics, and proposed a novel deep architecture that models group activities in LSTM networks, models personal activities in a first stage, and then combines personnel level information with representative community activities. The time characterization value of the model is based on a Long Short Term Memory (LSTM) network, with the goal of utilizing discriminative information in the hierarchy between individual behavior and community activities. However, although the method uses a two-layer LSTM network, the group behaviors are simply represented by combining the personal characteristics, the interaction relationship of the individuals cannot be utilized, and key people in the group cannot be identified, so that the accuracy of the group behavior identification is lower; in addition, given that the importance of each person to group behavior recognition in a group activity is different, this approach simply models each person, while also reducing the accuracy of group behavior recognition.

In addition, "Region based multi-stream convolutional neural networks for collective activity recognition" published in JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION in 2019 proposes a new multi-leave architecture of person-based areas for group activity recognition, which analyzes a plurality of local areas in addition to using whole image information, and well considers person-person and group-person interaction information, but does not well capture timing information of video and well consider optical flow motion information of individuals because LSTM network is not well utilized; and the proposed Fusion strategies such as Sum Fusion, max Fusion and Concatenation Fusion are all artificially formulated and cannot well represent the characteristics.

Disclosure of Invention

The invention provides a group behavior identification method based on key spatiotemporal information driving and group co-occurrence structural analysis, which aims to accurately identify the behavior of each individual in a group and infer the group behavior by utilizing the individual and the interaction characteristics among the individuals.

The invention is realized by adopting the following technical scheme: a group behavior identification method based on key space-time information driving and group concurrence structural analysis comprises the following steps:

step A, tracking each member in the video aiming at the video to be identified to obtain a boundary frame image x of the video _t Inputting the static characteristics and the dynamic characteristics into a candidate sub-network of the key person according to time sequence to extract the static characteristics and the dynamic characteristics, and identifying personal behavior attributes to obtain personal importance weights alpha _t ；

Step B, weighting the personal importance weight alpha obtained in the step A _t Personal bounding box image x _t Inputting to a main network CNN for analysis processing to obtain a spatial feature X input to a laminated LSTM network _t '＝x _t *α _t ；

Step C, taking the output of the step B as input to conduct co-occurrence feature modeling, grouping neurons of the laminated LSTM, and learning different co-occurrence features by different groups to obtain group features Z _t ；

Step D, the boundary frame image x in the step A _t Inputting the importance weight of the current frame into a key time segment candidate sub-network to extract the characteristics, namely the importance beta of the current frame _t ；

Step E, grouping the features Z obtained in the step C _t And the importance weight beta of the current frame obtained in the step D _t Combining to obtain group feature Z 'of current frame' _t And Z 'is' _t Input to softmax, and complete group behavior recognition.

Further, in the step a, the personal importance weight α is obtained _t The method is realized by the following steps:

(1) Firstly, establishing a key character candidate sub-network, wherein the key character candidate sub-network comprises a CNN layer, an LSTM layer, a full connection layer, a tanh activation function and a full connection layer which are connected in series;

(2) Secondly, obtaining behavior attribute scores of group members, specifically:

at the t-th moment, M members in the scene are set, and the boundary frame feature set extracted by the CNN network is x _t ＝(x _t，1 ,,...,x _t，M ) ^T Behavior attribute score s _t ＝(s _t,1 ,...,s _t,M ) ^T The behavior attribute score represents the behavior class judgment of the M members, expressed as:

wherein T is the length of the video sequence, U _s ,W _xs ,W _hs As a matrix of learnable parameters, b _s ,b _us As a result of the offset vector,

representing hidden variables from the LSTM layer.

(3) Finally, the importance weight of each member is obtained, and then the key person is determined, and the specific steps are as follows:

given group behavior class set G _{_action} ＝(A ₁ ,A ₂ ,...,A _q ) ^T For the kth person, calculate his behavioral attribute score s _t,k At G _{_action} Specifically measured by cosine similarity of the two multidimensional vectors:

representation of 2 norms;

the normalized coefficient converted into cosine angle is as follows:

the importance weight of each person in the spatial scene is calculated by the following formula:

α _t,k to determine how much each member contributes to the group behavior recognition task.

Further, the step B is specifically implemented by the following manner:

(1) Firstly, obtaining the spatial characteristics of a kth person at the moment t through importance weight modulation:

x' _t,k ＝α _t,k ·x _t,k

(2) Then, all the individual spatial features modulated by the importance weights are aggregated to be used as the input of the laminated LSTM in the main network, and the method is obtained:

X' _t ＝(x' _t,1 ,...,x' _t,K )。

further, in the step C, when the co-occurrence feature learning is performed, the following method is specifically adopted:

step C1, firstly, establishing an end-to-end full-connection depth LSTM network model to realize automatic learning of time sequence characteristics and motion modeling;

based on the LSTM layer and the feedforward layer, a deep network is formed by alternative arrangement, so that the motion information is captured, and the feedforward layer is positioned between the two LSTM networks, so that each layer of neurons are completely connected with the neurons of the next layer;

and C2, grouping neurons of the laminated LSTM, introducing a constraint on weights of member individuals and the neurons connected in an objective function, so that the neurons in the same group have larger weight connections to subsets formed by certain member individuals and smaller weight connections to other nodes, and mining the co-occurrence of the member individuals.

Further, in the step C1, in order to ensure that the full connection depth LSTM network model learns effective features, different types of regularization are implemented in different parts of the model, which specifically includes two types of regularization:

1) For fully connected layers, regularization is introduced to drive the model to learn co-occurrence features of individuals of different layers, and co-occurrence feature learning of nodes between LSTM layers;

2) For LSTM neurons, a new Dropout layer is derived and applied to LSTM neurons in the last LSTM layer.

Further, in the step C2, the excavation and utilization of co-occurrence is achieved by adding a group sparse constraint to the connection of each group of neurons and member individuals:

(1) The neurons of each layer of LSTM are grouped according to the group behavior category number, and for the neurons of the K group, the neurons of the group are trained to automatically distinguish different individual behaviors, and the co-occurrence regularization is added into the loss function:

where L is the maximum likelihood loss function of the deep LSTM network, W _xβ ＝[W _x,1 ；...；W _x,K ]Is a weight matrix connected with beta of the input unit, and N is set to represent the number of neurons, the N neurons are divided into theta groups, and the number of the neurons in each group is epsilon= [ N/theta ]]For LSTM layers, s= { i, f, o, c } represents the input gate, forget gate, output gate and cell in LSTM neurons, and for feed forward layers, s= { h } represents the neurons themselves;

the second term in equation (1) is L1 regularization, which is used to determine a relatively important subset of key characters during training; in the third term, matrix W may be encouraged due to the L2 norm _xβ,k Becomes sparse, so using the L2 norm definition for each group of cells is defined as

The characteristics with different descriptions are selected as input by driving the characteristics, different neuron groups explore different co-occurrence characteristics, and then a gradient descent method is adopted for solving.

Further, the step D is specifically implemented by the following manner:

(1) Firstly, establishing a key time segment candidate sub-network, wherein the key time segment candidate sub-network comprises a CNN layer and an LSTM layer which are connected in series, a full connection layer and a Relu nonlinear unit;

(2) According to the video sequence input into the sub-network, obtaining a behavior attribute score in the current frame by utilizing a Relu unit, namely: o (o) _t ＝RELU(w _x' x _t +w _h' h _t ' _-1 +b')＝(o ₁ ,o ₂ ,...o _C ) ^t C represents the total number of behavior categories, t represents the current frame, and its size depends on the current input x _t Hidden state h at t-1 time of LSTM layer _t ' _-1 ；

(3) Finally, according to the association degree between the current frame behavior attribute score and the group behavior attribute, the importance weight beta of the current frame in the input sequence T is obtained _t， The method comprises the following steps:

given group behavior class set G _{_action} ＝(A ₁ ,A ₂ ,...,A _q ) ^T Time importance weight beta _t By calculating the aggregate and current frame behavior attribute score o _t ＝(o ₁ ,o ₂ ,...o _C ) ^t The joint similarity coefficient of the two sets is expressed as:

where I represents intersection computation and n () represents the number of elements in the solution set.

Further, in the step E, the group characteristic Z is based _t And importance weight beta of the current frame _t The following method is specifically adopted when the group behavior identification is carried out:

(1) First, the frame group characteristics at the t-th moment are calculated as follows:

Z _t '＝Z _t ·β _t

(2) It is then input into the softmax layer for final group behavior recognition:

y＝softmax(Z _t ')

wherein y is the group behavior category.

Further, based on the complexity of the model, the main network, the key character candidate sub-network and the key time segment candidate sub-network are jointly trained, and specifically, the joint training process of the network is as follows:

input: training times N1 and N2 of the model;

(1) Initializing network parameters using a gaussian function;

(2) The method comprises the steps of fixing weights of candidate sub-networks of key characters, and jointly training a candidate sub-network main network of a key time period with only one LSTM layer of the main network to obtain a candidate model of the key time period;

(3) Repeating the iteration, and training the main network after increasing the LSTM layer to three layers through N1 iterations;

(4) Fine-tuning the main network and the candidate sub-network of the key time period through N2 iterations;

(5) Fixing the candidate sub-network of the key time period, and jointly training the candidate sub-network of the key person and the main network which only have one LSTM layer to obtain the candidate sub-network of the key person;

(6) Repeating the iteration, and training the main network after increasing the LSTM layer to three layers through N1 iterations;

(7) Fine-tuning the main network and the candidate sub-network of the key time period through N2 iterations;

(8) Obtaining a sub-network through N1 iterative joint training (4) and (7);

(9) Fine tuning the whole network model together through N2 iterations;

and (3) outputting: and finally converging to obtain the whole group behavior recognition model.

Compared with the prior art, the invention has the advantages and positive effects that:

according to the group behavior recognition method based on the space-time importance and the co-occurrence, an importance mechanism is used for focusing on important individual behaviors in group behaviors, more important personal characteristics and important scene frames are extracted, interaction information in the group behaviors is processed in a combined mode, the personal characteristics can be better utilized through the combination of the space-time importance and the co-occurrence, key information in a plurality of information is effectively utilized, and therefore accuracy and efficiency are improved, and the method has important scientific practicality and huge economic value.

Drawings

FIG. 1 is a diagram of an overall network architecture according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the internal structure of a primary network layer LSTM according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a grouping of neurons within each layer of LSTM in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of the internal structure of an LSTM neuron according to an embodiment of the invention.

Detailed Description

In order that the above objects, features and advantages of the invention will be more readily understood, a further description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as described herein, and therefore the present invention is not limited to the specific embodiments disclosed below.

In order to accurately identify the behavior of each individual in the group and infer the group behavior by using the characteristics of the individual and the interaction between the individual, the embodiment provides a group behavior identification method based on key spatiotemporal information driving and group co-occurrence structural analysis, which comprises the following steps:

step A, tracking each member in the video aiming at the video to be identified to obtain a boundary frame image x of the video _t Inputting the static characteristics and the dynamic characteristics into a candidate sub-network of the key person in time sequence to extract the static characteristics and the dynamic characteristics, and identifying personal behavior attributes so as to obtain the personal importance weight alpha _t ；

Step B, weighting the personal importance weight alpha obtained in the step A _t And personal bounding box image x _t Is input into the main network for multiplication processing to obtain the spatial feature X input into the main network layer LSTM _t '＝x _t *α _t ；

Step C, taking the output of the step B as input, modeling the co-occurrence characteristics, grouping the neurons of the layered LSTM, and differentDifferent co-occurrence characteristics are learned by the group of the (B) to obtain group characteristic Z _t ；

Step D, the boundary frame image x in the step A _t Inputting the importance weight of the current frame, namely importance beta of the current frame, into a key time segment candidate sub-network to perform feature extraction _t ；

Step E, grouping the features Z obtained in the step C _t And the importance weight beta of the current frame obtained in the step D _t Combining to obtain group feature Z 'of current frame' _t And inputting the group behavior identification data into softmax for group behavior identification, and completing the group behavior identification.

In order to realize the accuracy and efficiency of group behavior identification, the scheme designs a main network and two sub-networks (a key character candidate sub-network and a key time period candidate sub-network), wherein the main network is used for realizing feature extraction, space-time correlation utilization and final classification; the key character candidate sub-network is used for distributing proper importance to different individuals; the critical period candidate sub-network is used to assign appropriate importance to the different frames. The identification of the individual behavior is not performed within the main network, but rather the individual behavior is inferred directly within the sub-network to obtain the key persona ordering and then to control the input of useful information to the main network.

1. In a specific group behavior recognition process, although group activities are commonly performed by a plurality of persons, members determining group behaviors are usually few members (key persons) from which group behaviors are performed within a certain period of time. Therefore, two sub-networks are designed to pay attention to the key character information in the scheme so as to play a role in shielding the interference of useless information of other members and optimize the recognition precision of the model; the two sub-networks are a key person candidate sub-network and a key time period candidate sub-network respectively, wherein the key person candidate sub-network realizes quantitative queuing of the importance of the spatial position of the group member according to the relevance of the personal behavior and the group behavior, and controls the input of CNN spatial information in the main network; the key time period candidate sub-network realizes the quantitative choice of the importance of time information output by the laminated LSTM in the main network according to the relevance of the category of member behaviors and group behaviors, pays attention to a useful time slice terminal, and optimizes the information input to a classification layer softmax; the purification of the key moment information in the group behavior is realized through the two sub-networks, and the purpose is to shield the influence of interference noise of irrelevant personnel in the group.

2. In addition, as a complex group activity, a group's certain behavior is typically coordinated by several specific individuals within the group, and group members are present in structured groupings, with close interactions of the members within the small set. In the scheme, a core group formed by key characters is focused, influence of irrelevant members on group behavior identification is reduced, and characteristics of the core group members in cooperation interaction and cooperation to determine group behavior discrimination are called co-occurrence. There are a variety of group behaviors, so there will be a corresponding plurality of such relatively "stable co-occurrence subgroups". It is emphasized that the occurrence of these co-occurrence groups is time-varying and that only one "co-occurrence group" dominates for a certain time segment.

To characterize such "co-occurrence teams" in a team, three layers of co-iterated bi-directional LSTM layers are designed in the main network and LSTM neurons of each layer are grouped, each team focusing on only one class of team behavior, each neuron within a team needs to connect each member of the team (e.g., if neurons within one LSTM are grouped into 6 teams, 6 different teams of behaviors can be focused, and the six groups of focused behavior attributes are unchanged). Therefore, neurons in the same group are trained to have larger weight connection to a subset formed by a plurality of individuals with a certain specific behavior, and have smaller weight connection to other individuals with smaller correlation degree, the co-occurrence characteristics are learned by the method, the co-occurrence time sequence information of a key core small group is highlighted, and after the key time segment candidate sub-network characteristics are learned, redundant useless information in the LSTM can be further restrained, the signal-to-noise ratio of the co-occurrence time sequence information is improved, and the signal-to-noise ratio of the co-occurrence time sequence information is used as input of a classification layer softmax, so that the recognition accuracy of group behaviors is improved.

The following describes the group behavior recognition method in detail in this embodiment, specifically:

in step A, a personal importance weight alpha is obtained _t The method is realized by the following steps:

(1) Firstly, establishing a key character candidate sub-network, wherein the key character candidate sub-network comprises CNN and LSTM layers, a full connection layer, a tanh activation function and a full connection layer which are connected in series as shown in figure 1;

at time t, M members in the scene are set, and the boundary frame feature set x extracted through CNN network _t ＝(x _t，1 ,,...,x _t，M ) ^T Behavioral attribute score s _t ＝(s _t,1 ,...,s _t,M ) ^T The behavior class judgment representing M members is obtained by the following formula:

representing hidden variables from the LSTM layer;

(4) Finally, the importance weight of each member is obtained, and then the key person is determined, and the specific steps are as follows:

given group behavior class set G _{_action} ＝(A ₁ ,A ₂ ,...,A _q ) ^T For the kth person, calculate his behavioral attribute score s _t,k At G _{_action} Can be measured by cosine similarity of the two multidimensional vectors:

representation of 2 norms;

the normalized coefficient converted into cosine angle is as follows:

α _t,k to determine how much each member contributes to the group behavior recognition task, and thus control the amount of information that it flows to the primary network.

In step B, the personal importance weight alpha obtained in step A is used for obtaining the personal importance weight alpha _t And personal bounding box feature x _t The features input to the main network CNN are multiplied in order to obtain the feature X input to the stacked LSTM network _t '＝x _t *α _t The method is realized by the following steps:

x' _t,k ＝α _t,k ·x _t,k

(2) All importance weight modulated individual spatial features are then aggregated as inputs to the stacked LSTM:

X' _t ＝(x' _t,1 ,...,x' _t,K )

the key person candidate sub-network proposed in this embodiment determines the importance of the individual behavior based on all individuals at the current time and hidden variables in the LSTM layer, and the key person candidate sub-network aims to assign importance weights to individuals in group activities, taking into account the LSTM hidden variable h _t-1 Information of past frames is contained so that it can explore long-term dynamics.

In step C, taking the output of step B as input, modeling the co-occurrence characteristics, grouping the neurons of the laminated LSTM, and learning different co-occurrence characteristics by different groups to obtain group characteristics Z _t The method is realized by the following steps:

(1) Firstly, an end-to-end full-connection depth lamination LSTM network model is established, and time sequence feature learning and motion modeling are realized; the method aims at reliably modeling complex relations among different individuals, and adopts LSTM layers and feedforward layers to be alternately deployed to form a deep network for capturing motion information, wherein the feedforward layer is positioned between the two LSTM layers, as shown in figure 2, and the effect is that each layer of neurons are completely connected with neurons of the next layer, no same-layer connection exists among the neurons, and no cross-layer connection exists.

In order to ensure that the model learns effective features, different types of regularization are implemented in different parts of the model, so that the problem of overfitting (the model is a depth model formed by a fully-connected LSTM network and a feedforward layer, the structure is relatively complex, the overfitting problem is easy to cause, so that the regularization design is to reduce the complexity of the model to solve the overfitting problem), specifically, two types of regularization are provided in the embodiment:

1) For fully connected layers, the present embodiment introduces regularization to drive the model to learn co-occurrence features of individuals of different layers, as well as co-occurrence feature learning of nodes between LSTM layers;

2) For LSTM neurons, deriving a new Dropout layer and applying it to LSTM neurons in the last LSTM layer helps the network learn complex dynamics.

(2) And grouping the neurons in the layered LSTM, introducing the constraint on the weights of the member individuals and the neurons in the objective function, so that the neurons in the same group have larger weight connections to a subset formed by some member individuals and smaller weight connections to other nodes, thereby mining the co-occurrence of the member individuals.

As shown in fig. 3, the LSTM layer of the main network is composed of LSTM neurons divided into K groups, each neuron in the same group having a larger connection weight with some individuals (i.e., a subset of members closely related to the behavior of a group of a certain type) and a smaller connection weight with other individuals. The degree of sensitivity of different sets of neurons to different sets of group behaviors is different, in that individual subsets of different sets of neurons corresponding to larger connection weights are also different. In practice, the mining and exploitation of co-occurrence described above may be achieved by adding a set of sparsity constraints to the connection of each set of neurons and member individuals.

1) Neurons of each layer LSTM are grouped according to the number of group behavior categories, e.g., 10 behavior categories are grouped into 10 groups. Each neuron is fully connected to an individual, and each individual is of different importance, and the trained neurons can know which individuals are important, thus highlighting the importance group.

Thus, the present embodiment designs a fully connected main network, allowing each neuron to connect to any individual to implement co-occurrence features inside an auto-discovery group, grouping neurons in the same layer into θ groups, allowing different groups to focus on discriminating different behavioral classifications. Taking the K group of neurons as an example, the group of neurons are trained to automatically distinguish different individual behaviors, and the co-occurrence regularization is added into the loss function, so that the design is as follows: (the regularization of the model is used for two purposes, one is to prevent overfitting, and the other is to integrate prior information, so that the model can learn the desired effect

Where L is the maximum likelihood loss function of the deep LSTM network, W _xβ ＝[W _x,1 ；...；W _x,K ]Is a weight matrix connected with the input unit beta, and N is set to represent the number of neurons, the N neurons are divided into theta groups, and the number of the neurons in each group is epsilon= [ N/theta ]]For LSTM layers, s= { i, f, o, c } represents the input gate, forget gate, output gate and cell in LSTM neurons, and for feed forward layers, s= { h } represents the neurons themselves;

the second term in equation (1) is L1 regularization, which reduces rapidly for small weights and slowly for large weights during training, so the weights of the final model are mainly concentrated at high valuesOn features of importance, the weight will quickly approach 0 for less important features, and thus a relatively important subset of key characters can be determined therefrom. In the third term, matrix W may be encouraged due to the L2 norm _xβ,k Becomes sparse, so using the L2 norm definition for each group of cells is defined as

The method is characterized in that characteristics with different descriptions can be selected as input by driving the method, different neuron groups explore different co-occurrence characteristics so as to obtain the capability of identifying various action categories, and then a gradient descent method is adopted for solving. (matrix W _xβ,k The optimization, the value of which becomes sparse, is actually to realize the selection of the feature importance of the small group bodies with theta groups of co-occurrence.

During training, the network will drop some neurons randomly to force the rest of the network elements to compensate, and during testing, the network will use all neurons together to make predictions. Extending this idea to LSTM networks, the present approach uses a new gradient descent algorithm that allows the internal gates, cells and output responses of LSTM neurons to selectively gradient down, encouraging each cell to learn better parameters, as shown in fig. 4, exposing LSTM neurons in an expanded form, for which it is undesirable to erase all information in the cell because it memorizes events that occurred over time, thus allowing the lost effects in LSTM to flow along the various layers of dashed lines, prohibiting it from flowing along the time axis, as shown in fig. 4.

The response of the non-lost cell transmitted in the time direction is:

the response of the lost cell is:

wherein m is _i ,m _f ,m _c ,m _o And m _h Input gate, forget gate, cell memory cell, output gate and output response missing binary mask vector, element value 0 indicates that a miss occurred. For the first LSTM layer, input x _t Is the obtained single person behavior feature; for higher LSTM layers, input x _t Is the response output of the previous layer.

In step D, the boundary box information in step A is input into a key time segment candidate sub-network to perform feature extraction, and a current frame is obtainedImportance weight of (2), i.e. importance beta of the current frame _t Specifically, the method is implemented by the following steps:

(1) Firstly, establishing a key time slice candidate sub-network, wherein the key time slice candidate sub-network comprises CNN and LSTM layers, a full connection layer and a Relu nonlinear unit which are connected in series as shown in figure 1;

(2) Then, according to the video sequence input into the sub-network, the behavior attribute score in the current frame is obtained by using the Relu unit, namely: o (o) _t ＝RELU(w _x' x _t +w _h' h _t ' _-1 +b')＝(o ₁ ,o ₂ ,...o _C ) ^t C represents the total number of behavior categories, t represents the current frame, and its size depends on the current input x _t Hidden state h at t-1 time of LSTM layer _t ' _-1 。

For a sequence of video frames, the amount of valuable information provided by the different frames is often unequal, with only some frames containing the most distinct information and other frames providing context information as a supplement. For example, for a group action of a cue ball in a volleyball match, the importance of the action frame, such as a cue ball, a jump, etc., is lower than that of a cue ball.

(3) Finally, according to the association degree between the current frame behavior attribute score and the group behavior attribute, the importance weight beta of the current frame in the input sequence T is obtained _t The method specifically comprises the following steps:

where I denotes calculating the intersection of two sets and n () denotes solving the number of elements in a set.

In step E, the group feature Z obtained in step C is used _t And the importance weight beta of the current frame obtained in the step D _t Combining to obtain group feature Z 'of current frame' _t And inputs it into the softmax layer for final group behavior identification, specifically by:

Z _t '＝Z _t ·β _t

y＝softmax(Z _t ')

wherein y is the group behavior category.

The invention designs a sub-network to enable the network to pay different levels of attention to different individuals and allocate different importance to different frames, as shown in a network model figure 1, key character candidate sub-networks mainly act on the input of a main LSTM, and key time segment candidate sub-networks mainly act on the output of the main LSTM. The purpose of the two sub-networks is to enable the network to pay different levels of attention to different individuals and to assign different importance levels to different frames.

The scheme integrates a key character candidate and a key time period candidate sub-network in the same network, wherein the key character candidate sub-network acts on the input of the main LSTM network, and the key time period candidate sub-network acts on the output of the main LSTM network. The final objective function of the two subnetworks is formulated as a sequence with regularized cross entropy loss, as follows:

where y= (y 1,., yC) represents the real tag, if it belongs to class i, y for j not equal to i _i =1 and y _j ＝0。

Representing a probability scalar that predicts the sequence as class i. λ1, λ2, and λ3 balance the contributions of the three regularization terms.

The first regularization term is intended to encourage the key persona candidate subnetwork to dynamically focus on more spatial nodes in the sequence. This embodiment finds that the network model tends to continually ignore many individuals over time, even though these individuals are valuable for determining the type of action, i.e., are trapped in locally optimal locations, and therefore introduces this regularization term to avoid such an uncomfortable solution. The second regularization term is to regularize the learned critical period candidate sub-networks with L2 norm control, rather than adding them without limitation. This mitigates the disappearance of the gradient in the back propagation, which is proportional to 1/βt. The third regularization term for L1 norm is to reduce the over-fitting of the network, W _uv Representing a connection matrix in the network.

Finally, considering the complexity of the model, a strategy of the combined training of the main network and the sub-network is provided to enable the model to achieve a better result, and the combined training process of the network is as follows:

because of the interaction of these three networks, containing two loss functions, the optimization effort is quite difficult. Therefore, the scheme provides a strategy of combined training, which can effectively train the whole model, and the training process is as follows:

input: training times N1, N2 of model (e.g., n1=1000, n2=500)

1. Initializing network parameters using a gaussian function;

2. fixing the weights of the candidate sub-networks of the key characters (according to the initialized weights), and jointly training the candidate main network of the sub-network of the key time period with only one LSTM layer of the main network to obtain a candidate model of the key time period;

3. repeating the iteration, and training the main network after increasing the LSTM layer to three layers through N1 iterations;

4. fine-tuning the main network and the candidate sub-network of the key time period through N2 iterations;

5. fixing the candidate sub-network of the key time period, and jointly training the candidate sub-network of the key person and the main network which only have one LSTM layer to obtain the candidate sub-network of the key person;

6. repeating the iteration, and training the main network after increasing the LSTM layer to three layers through N1 iterations;

7. fine-tuning the main network and the candidate sub-network of the key time period through N2 iterations;

8. obtaining a sub-network through N1 iterative joint training steps 4 and 7;

9. fine tuning the whole network model together through N2 iterations;

For the group behavior recognition method provided by the scheme, the key person candidate sub-network is adopted to allocate different importance weights to each person in the group, so that the information is not lost, and the individuals with large contribution to the group behavior recognition can be concerned; and the candidate sub-network of the key time period is utilized to allocate different importance weights for each frame, any frame is not discarded, no data loss is caused, and iteration is continuously trained through a model, so that the efficiency and the accuracy of group behavior recognition can be greatly improved.

In addition, the expression of the interactive relations at the current stage is to model the interactive relations of the characters in the group by using a graph model, the data are huge, and the model training is difficult, but in the scheme, the interactive relations inside the group behaviors are processed by using the co-occurrence property, the fully-connected stacked bidirectional LSTM is adopted, the neurons of the stacked bidirectional LSTM are grouped, different behaviors are identified by different groups, and the accuracy of the group behavior identification is further effectively improved.

The present invention is not limited to the above-mentioned embodiments, and any equivalent embodiments which can be changed or modified by the technical content disclosed above can be applied to other fields, but any simple modification, equivalent changes and modification made to the above-mentioned embodiments according to the technical substance of the present invention without departing from the technical content of the present invention still belong to the protection scope of the technical solution of the present invention.

Claims

1. The group behavior identification method based on key space-time information driving and group concurrence structural analysis is characterized by comprising the following steps of:

step A, tracking each member in the video aiming at the video to be identified to obtain a boundary frame image x of the video _t Inputting the obtained personal importance weight alpha into a candidate sub-network of the key person in time sequence to extract static and dynamic characteristics, and identifying personal behavior attributes _t ；

Step B, weighting the personal importance weight alpha obtained in the step A _t Personal bounding box image x _t Inputting the spatial characteristics X ' into a main network for analysis and processing to obtain the spatial characteristics X ' input into a laminated LSTM network ' _t ＝x _t *α _t ；

Step D, the boundary frame image x in the step A _t Inputting the importance weight beta of the current frame into a key time segment candidate sub-network for feature extraction _t ；

2. The method for group behavior identification based on key spatiotemporal information driven and group co-occurrence structured analysis of claim 1, wherein: in the step A, the personal importance weight alpha is obtained _t The method is realized by the following steps:

representing hidden variables from the LSTM layer;

given group behavior class set G _{_action} ＝(A ₁ ,A ₂ ,...,A _q ) ^T For the kth person, calculate his behavioral attribute score s _t,k At G _{_action} Is measured by cosine similarity of the two multidimensional vectors:

representation of 2 norms;

the normalized coefficient converted into cosine angle is as follows:

3. The method for group behavior identification based on key spatiotemporal information driven and group co-occurrence structured analysis of claim 2, wherein: the step B is specifically realized by the following steps:

x' _t,k ＝α _t,k ·x _t,k

(2) Then, all the individual spatial features modulated by the importance weights are aggregated to be used as the stacked LSTM input in the main network, and the method is obtained:

X' _t ＝(x' _t,1 ,...,x' _t,K )。

4. the method for group behavior identification based on key spatiotemporal information driven and group co-occurrence structured analysis of claim 1, wherein: in the step C, when co-occurrence feature learning is performed, the following specific method is adopted:

step C1, firstly, establishing an end-to-end full-connection depth LSTM network model to realize time sequence feature learning and motion modeling;

and C2, grouping neurons of the laminated LSTM, introducing the constraint on weights of member individuals and the neurons connected in an objective function, so that the neurons in the same group have large weight connections to subsets formed by certain member individuals and small weight connections to other nodes, and thus the co-occurrence of the member individuals is mined.

5. The method for group behavior identification based on key spatiotemporal information driven and group co-occurrence structured analysis of claim 4, wherein: in step C1, in order to ensure that the full connection depth LSTM network model learns effective features, different types of regularization are implemented in different parts of the model, specifically including two types of regularization:

1) For fully connected layers, regularization is introduced to drive the model to learn co-occurrence features of individuals of different layers, and of nodes between LSTM layers;

6. The method for group behavior identification based on key spatiotemporal information driven and group co-occurrence structured analysis of claim 5, wherein: in the step C2, the excavation and utilization of co-occurrence is achieved by adding a group sparse constraint to the connection of each group of neurons and member individuals:

where L is the maximum likelihood loss function of the deep LSTM network, W _xβ ＝[W _x,1 ,...,W _x,K ]Is a weight matrix connected with the input unit beta, and N is set to represent the number of neurons, the N neurons are divided into theta groups, and the number of the neurons in each group is epsilon= [ N/theta ]]For LSTM layers, s= { i, f, o, c } represents the input gate, forget gate, output gate and cell in LSTM neurons, and for feed forward layers, s= { h } represents the neurons themselves;

the second term in equation (1) is L1 regularization, at training timeFor determining a relatively important subset of key characters; in the third term, matrix W may be encouraged due to the L2 norm _xβ,k Becomes sparse, so using the L2 norm definition for each group of cells is defined as

7. The method for group behavior identification based on key spatiotemporal information driven and group co-occurrence structured analysis of claim 2, wherein: the step D is specifically realized by the following steps:

(2) Then, according to the video sequence input into the sub-network, the behavior attribute score in the current frame is obtained by using the Relu unit, namely: o (o) _t ＝RELU(w _x' x _t +w _h' h′ _t-1 +b')＝(o ₁ ,o ₂ ,...o _C ) ^t C represents the total number of behavior categories, t represents the current frame, and its size depends on the current input x _t Hidden state h 'at t-1 time of LSTM layer' _t-1 ；

given group behavior class set G _{_action} ＝(A ₁ ,A ₂ ,...,A _q ) ^T Time importance weight beta _t By calculating the aggregate and current frame behavior attribute scores o _t ＝(o ₁ ,o ₂ ,...o _C ) ^t The joint similarity coefficient of the two sets is expressed as:

8. The method for group behavior identification based on key spatiotemporal information driven and group co-occurrence structured analysis of claim 1, wherein: in the step E, the group characteristic Z is based _t And importance weight beta of the current frame _t The following method is specifically adopted when the group behavior identification is carried out:

Z′ _t ＝Z _t ·β _t

y＝softmax(Z _t ')

wherein y is the group behavior category.

9. The method for group behavior identification based on key spatiotemporal information driven and group co-occurrence structured analysis according to claim 1, characterized in that: based on the complexity of the model, the main network, the key character candidate sub-network and the key time segment candidate sub-network are jointly trained, and the joint training process of the network is specifically as follows:

input: training times N1 and N2 of the model;

(1) Initializing network parameters using a gaussian function;

(5) Fixing the key time segment candidate sub-network, and jointly training the key character candidate sub-network and the main network which only have one LSTM layer to obtain the key character candidate sub-network;

(8) Obtaining a sub-network through N1 iterative joint training (4) and (7);

(9) Fine tuning the whole network model together through N2 iterations;