CN111401174A

CN111401174A - Volleyball group behavior identification method based on multi-mode information fusion

Info

Publication number: CN111401174A
Application number: CN202010154331.6A
Authority: CN
Inventors: 毋立芳; 付亨; 简萌; 徐得中; 袁元
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-03-07
Filing date: 2020-03-07
Publication date: 2020-07-10
Anticipated expiration: 2040-03-07
Also published as: CN111401174B

Abstract

A volleyball group behavior recognition method based on multi-mode information fusion is applied to the field of computer vision group behavior recognition. Due to the wide application in the aspects of sports analysis, automatic video monitoring systems, man-machine interaction applications, video recommendation systems and the like, group behavior recognition tasks are concerned. For group behavior recognition in a multi-person scenario, modeling of relationships between objects and motion patterns can provide discriminative visual cues. The invention aims to introduce the relation between image targets and the motion mode as multi-modal information and then effectively encode and globally infer the information by using a sequence model GRU. Finally, based on the attention mechanism, the obtained information of the inference module is integrated from the time domain perspective and the final result is obtained. The method realizes the group behavior identification in the volleyball data set, verifies the feasibility of the method through testing, and has important application value.

Description

Volleyball group behavior identification method based on multi-mode information fusion

Technical Field

The invention is applied to the field of computer vision group behavior identification, and particularly relates to digital image processing and deep learning technologies such as optical flow feature extraction, appearance feature extraction, a recurrent neural network and an attention mechanism. The method takes volleyball broadcast sports videos as input images, extracts apparent features, motion mode features and relation features of target images through a depth model, then uses a recurrent neural network and an attention mechanism to perform feature fusion, and summarizes multi-mode information results to realize a behavior recognition task of a multi-user group.

Background

Group behavior recognition is a comprehensive analysis task, and is widely applied in aspects of intelligent sports analysis, automatic video monitoring, human-computer interaction application, video recommendation systems and the like, so that the group behavior recognition is concerned. In order for a computer to intelligently understand the behavior occurring in a multi-person scenario, the designed model needs to describe not only the individual behavior of each target in the scenario, but also infer their group behavior. The ability to accurately capture the corresponding relationships between people and perform relationship inference is crucial to understanding multi-person group behavior. However, modeling relationships between people is a challenging task, as we typically focus on individual and group behaviors, and do not take full advantage of potential interaction information. It is therefore desirable to infer relationships between the participating target persons from the apparent features and relative position and movement pattern information. Therefore, when we design an effective depth model for group behavior understanding, we need to integrate these important clues to perform inference.

Disclosure of Invention

In order to realize the volleyball group behavior recognition function, a group behavior recognition scheme based on multi-mode information fusion is provided, and the flow of the method is shown in fig. 1. The method takes volleyball match video images as input, each module finishes feature extraction and feature analysis on the video images, and finally outputs the volleyball group identified by a system as a category. Specifically, the method comprises the steps of firstly, selecting partial images before and after a key frame in a Volleyball game broadcast video sequence according to a group behavior tag, wherein image materials of the partial images are from a Volleyball public data set; then extracting the apparent characteristics of each target individual (player) by using a trained deep convolutional neural network model according to the position marking information of each individual in the target image, wherein the marking information of the individual position is also provided by the data set; then, carrying out optical flow extraction on the images of two adjacent frames by using an optical flow extraction network model to obtain an optical flow graph, and sending the optical flow graph into a trained depth network after carrying out quantization processing on the optical flow graph to obtain a motion mode expression characteristic of an image scene; then, modeling and expressing the relationship characteristics by using an attention mechanism based on the geometric information extracted from the rectangular frame coordinates of each target individual and the apparent characteristics of each target; then, effectively coding and globally reasoning the multi-modal information by using a recurrent neural network sequence model GRU, and fusing the characteristics; and finally, integrating the obtained information of the inference module from the aspect of time domain based on an attention mechanism and obtaining a final recognition result. The general framework of the method is shown in fig. 2, and the following modules are mainly designed and applied: the system comprises an apparent feature extraction module, a relation feature extraction module, a motion mode feature extraction module, a global reasoning module and a time domain fusion module. Through the cooperative processing of the modules, effective information of multiple modes contained in the video image is extracted and effectively combined, so that the action recognition function of the volleyball competition group is realized.

The invention contents of each main module of the method are as follows:

1. apparent feature extraction module

The first module is an apparent feature extraction module, which functions to extract the apparent features of each target individual in the image as a kind of multi-modal information. This module extracts the apparent features of each target individual (player) by using a trained deep convolutional neural network model based on the position labeling information of each individual in the target image. The image appearance feature is a feature which is abstractly extracted from image RGB information distribution based on a convolutional neural network and is used for expressing image semantic information. As an important component of multimodal information, apparent features play an important role in identifying group behavior.

Firstly, a trained deep convolutional neural network model is used for extracting full-map features from volleyball video images, then a RoI-Align mechanism in a Mask-RCNN algorithm model is applied to process the corresponding relation between a candidate box (bounding box) of each participated target individual (operator) and the full-map features, so that the feature extraction of each target individual is completed, then, full-link layers are used for carrying out vector alignment on the features, and D-dimensional apparent feature vectors of each target individual are obtained through the full-link layers.

2. Relational feature extraction module

The second module is a relational feature extraction module, and the function of the module is to extract the relational features of each target individual in the image as information of a new modality. Firstly, geometric information features are extracted from geometric coordinates of each target rectangular frame in an image by using a bounding box object regression (bounding box regression target) formula, and then the geometric information and the apparent information are subjected to relational modeling and feature expression by using a relational modeling method in a relationship Network algorithm model for the extracted geometric position information. The inter-target relation features are extracted through a series of nonlinear transformation and an attention mechanism based on the size relation and the geometric position relation between targets. As an important component of multi-modal information, the relationship features play a role in feature enhancement through a method of embedding the apparent features.

Firstly, embedding geometric features between any two targets in an image into a high-dimensional space with K dimension (K is a high-dimensional space dimension coefficient) for expression based on a bounding box target regression formula, wherein the geometric position labels of the target individuals are provided by a public data set 'Volleyball', then combining the high-dimensional geometric information with the dimension of K with apparent feature information, and executing a series of nonlinear transformation through weight training operation.

3. Motion pattern feature extraction module

The third module is a motion mode feature extraction module, and the function of the module is to extract the motion mode features of the image as information of a new mode. And (3) sending the optical flow quantization diagram of the target image into a trained residual error network classification model, wherein the obtained characteristics are characteristic expression of the motion mode of the whole image scene. The image motion mode features are abstractly extracted based on image time sequence change, are used for expressing motion information of target images and motion relations between targets, and are also important components of multi-modal information.

The method comprises the steps of firstly, using an optical flow extraction network PWC-NET to extract an optical flow graph of a selected adjacent video image to obtain an optical flow image for expressing image motion, then, carrying out quantization processing on the optical flow graph, using the value of the pixel motion degree to map to a color space in a range of 0-255 to obtain a quantized optical flow graph, finally, sending the quantized optical flow graph to a trained depth classification model to obtain motion mode expression features of an image scene, and finally outputting a feature vector with a dimension of N × D, wherein the flow of the module is shown in figure 5.

4. Global reasoning module

The fourth module is a global reasoning module, and the function of the fourth module is to integrate the multi-modal feature information extracted by the modules. And sending the multi-mode information into a trained recurrent neural network sequence model GRU, realizing effective coding and global reasoning of the information, and fusing the apparent characteristics, the relation characteristics and the image motion mode characteristics of the target individual.

In order to fuse a plurality of characteristics to facilitate understanding of group behaviors, a GRU model is introduced to model the GRU model facing to paired relation information. The GRU may remember long-term information as an effective memory unit, and the GRU Cell may choose to ignore certain portions of the target state that are not relevant to the current motion expression, or to use multimodal information to enhance certain portions of the target state.

A group of feature fusion modules, namely, Opt-GRU and relationship-GRU, are provided in the method and are used for coding different features to transmit messages, so that the function of semantic information global reasoning is realized. Firstly, multi-mode information is gathered, and various feature information apparent features fa, relationship features fr and motion mode features fopt are subjected to vertical splicing deformation on an N channel so as to conform to the input format of a GRU. Then, the apparent feature fa is used as a hidden unit input of the two GRUs for relationship inference, and feature vectors output by the two GRUs are fused using an averaging pooling operation for the multi-modal feature information fr and fopt respectively output by the relationship feature extraction module and the motion mode feature extraction module which are input by the relationship-GRU and Opt-GRU. Finally, a maximum value pooling operation is needed to obtain the global information characteristics of the frame level (frame-level) of the aggregation finishing. And obtaining the global reasoning characteristic with the dimension D for each frame of image in the video. The specific flow of the module is shown in fig. 6.

5. Time domain fusion module

The fifth module is a time domain fusion module which has the function of fusing the characteristics of each frame of the video in the time domain. The module integrates the information obtained by the global reasoning module from the time domain perspective through an attention mechanism algorithm and outputs a final recognition result

According to the above modules, the multi-modal characteristics of a certain frame in the video are obtained, however, for a period of time sequential video, the time domain information is also very important. Since each frame in the video contributes differently to the entire event in the time domain, the semantic information of the frame is utilized in the present invention, and the features of the frame level (frame-level) are further integrated with the time domain information to form the features of the sequence level (sequential-level).

The invention inputs all global features obtained under the same group event into an Attention layer (Attention layer), wherein the global features are set according to self-authorization parameters, the features of frame levels (frame-levels) are fused into the features of sequence levels (sequence-levels) in a descending way, finally, the fused features are sent into a trained classification network layer (Softmax L eye), and finally, the recognition result of the behavior of the volleyball group is output, and the flow of the module is shown in FIG. 7.

Through the effective collocation of the modules, the group behavior recognition task of the volleyball match video is completed together. The selected volleyball match video images and the individual marking frames thereof are used as input, and the apparent feature extraction module is used for extracting individual features of the selected volleyball match video images and outputting the apparent features of each individual of each image; the relationship feature extraction module takes the individual apparent features and the individual rectangular boxes as input and outputs relationship features for expressing the interaction relationship between individuals; the motion mode feature extraction module takes the video image as input and outputs motion mode features for expressing the global motion state of the image; and then, carrying out feature fusion on the individual apparent features, the relationship features and the motion mode features in sequence through a global reasoning module and a time domain fusion module, analyzing by combining the fused features, and finally outputting a volleyball group behavior recognition result.

Drawings

FIG. 1 is a flow chart of a volleyball group behavior identification scheme;

FIG. 2 is a general framework of a volleyball group behavior recognition scheme;

FIG. 3 is an apparent feature extraction module framework;

FIG. 4 is a relational feature extraction module framework;

FIG. 5 is a motion pattern feature extraction module framework;

FIG. 6 is a global reasoning module framework;

FIG. 7 is a time domain fusion module framework;

FIG. 8 is an example of a volleyball group behavior video frame annotation;

FIG. 9 is an example of an RGB original image of volleyball game video and a light flow diagram thereof;

fig. 10 is an exemplary diagram of the classification result of the behavior of the volleyball group.

Detailed Description

Teaching how models for individual modules are trained

The invention provides a volleyball group behavior identification method based on multi-mode information fusion. Based on the annotations provided by the "Volleyball" dataset, the population events can be classified into the following 8 categories: left first pass (l _ pass), left second pass (l _ set), left ball snap (l _ spike), left score (l _ winpoint), right first pass (r _ pass), right second pass (r _ set), right ball snap (r _ spike), and right score (r _ winpoint).

The method comprises the following specific implementation steps:

1. apparent feature extraction module

The apparent features are an important component as one of the multi-modal features that identify group behavior. The invention uses a resnet-50 residual error network model as the backbone network of the module, and processes targets at different positions by matching with a roi-align method.

In the Volleyball dataset, each video sequence consists of 21 game video frames with player position labels, a rectangular box label for each player target is provided in the dataset, and only the 5 frames before and 4 frames after the key frame are used for training the network model, for a total of ten images. Let it be a source identification image of a volleyball group as an event.

In the process of training the deep network for extracting the apparent features, the resnet-50 is selected as the backbone network, so that the feature extraction effectiveness is ensured, and the calculation cost is reduced. After extracting multi-scale features from the target image, the backbone network integrates the position coordinate information of different target individuals (actors) by using a roi-align processing algorithm, so that the model obtains the apparent features of each player respectively. And finally, integrating the characteristics of each target individual (actor) by using a maximum pooling method, and classifying the integrated characteristics by using a softmax layer. In the training process, an emb _ features parameter of the backbone network is set to 2048, and the apparent feature size is set to 1024;

the data for training was partitioned with reference to the training, validation, test set given by the volleyball official, for 200 rounds of training, the learning rate was set to 0.00001,

in the process of extracting the features, a filling method is designed for extracting apparent features with the same dimension corresponding to the phenomenon that the number of the operators in individual image frames is inconsistent. That is, in images with the number of objects smaller than N, N is 12 in the Volleyball dataset, and the candidate frames with the largest long sides among the existing objects are sequentially copied and filled. And then, using the trained model to perform feature extraction on the model, and storing the model offline. And finally, the extraction of the apparent features with the dimension of 12 x 1024 in each picture is realized.

2. Relational feature extraction module

The relational features are used to represent the relationship between target individuals (actors) as a kind of multi-modal information that reinforces the apparent features. In order to construct the expression of the Relation between the operators, the part is improved based on a basic Network model 'relationship Network' for representing the Relation.

In the Volleyball data set, the position coordinate information of each player target (actor) is included in each frame image, so that the apparent characteristics f of each player are obtained through the apparent characteristic extraction module_A. In the module, coordinate information is converted into high-order space expression through a bounding box target regression formula (bounding box regression target), and the geometric characteristic f of the coordinate information is defined_G. Information of a 4-dimensional rectangular box (bounding box) originally labeled as each target individual (actor) is embedded under a high-dimensional space of 64 dimensions by the following formula (1) for representing geometric information between target boxes. Assuming N targets, the geometrical relationship between the ith and jth targets is expressed as:

f_Grepresenting the geometrical characteristics, x, y, w and h respectively represent the horizontal and vertical coordinates of the upper left corner of the rectangular frame and the width and height of the rectangular frame. The subscripts i and j in the formula represent the number of the object.

For each frame in the volleyball event video, obtaining the apparent characteristics f of N target individuals (actors)_AGeometric feature f_G. Relational characteristics f of all target individuals (actors)_R(i) Is calculated as follows:

relationship characteristic f in formula (2)_RIs a weighted sum of apparent features of the target individual (actor),

apparent feature representing jth target by weight W_VAnd (5) performing linear transformation, wherein the weight is obtained by training and learning together with a subsequent module. Relation weight w^ijTo represent the impact from between the i and j targets, expressed as follows:

appearance weight in equation (3)

Geometric weight calculated by equation (4)

Calculated by the formula (5), and

and

the calculation method of (2) is consistent with the formulas (4) and (5), i, j and k in the corner mark represent the targets from the ith, jth and kth, k represents the size of the geometric feature,

here we show the normalization of the jth target to the k dimension.

W in formula (4)_kAnd W_qRespectively are the apparent features of the map

And

a weight matrix to the subspace, the weights being derived by co-training with subsequent modules ⊙ in the formula represents a bit-wise multiplication (element-wise) operation, i.e. multiplication of corresponding bits of a vector d_kRepresenting the feature size after projection. In the formula (5), function

Represents the calculation procedure of formula (1), f_gFour-dimensional coordinates, W, representing a rectangular box_GRepresenting a learning weight, which is derived by co-training learning with the subsequent modules.

In summary, the geometric features among 2 target individuals (actors) are embedded into a 64-dimensional space for high-dimensional expression, and the geometric features f are expressed as N × K dimensions_G(N is the number of actors, K is the geometric feature size). Embedded feature pass W_GConversion to scalar weights and then execution of a non-linear operation. Non-linear operation limits the relationship between objects having a certain geometric relationship. Finally, the relational expression of each actor is shaped into a relational feature f of D-dimension size_R. N is set to 12, K is set to 64, d_kSet to 64 and D to 1024. And obtaining 12 x 1024 relational feature expression.

Wherein the geometric feature f_GExtracting in advance according to the numerical value of the target frame, storing the numerical value into an offline file for facilitating subsequent calculation, and extracting the geometric feature f_RParameter W of middle need of training_V、W_G、W_kAnd W_qAnd the part is obtained by training together with the global reasoning module and the time domain fusion module without independent parameter training.

3. Motion pattern feature extraction module

The motion mode features are another important multi-modal information for enhancing the apparent features. An example of the volleyball game video artwork and light flow diagram correspondence is shown in fig. 9.

In the module, firstly, a corresponding optical flow graph is extracted from a volleyball video by using an optical flow extraction network PWC-Net which is pre-trained on a UCF101 data set, and an output result is stored. The output light flow image needs to be calculated by selecting two adjacent frames, 10 frames of images before and after the key frame are used for identification, and accordingly, one frame needs to be additionally added after the 10 th frame to obtain the same number of light flow graphs so as to facilitate subsequent calculation. Based on observation and statistics of the output light flow graph, the motion information value is [ -20,20 [ -20 [ ]]Filtering is carried out for a specified range, and the motion information outside the range is quantized into-20 and 20 respectively, so that the aim of filtering noise information is fulfilled. Then [ -20,20 [ -20]The values in the range are scaled equally and mapped to [0, 255 ]]In the color expression space of (2), the calculation process is as shown in equation 4. Wherein V_oFor movement information corresponding to the light-flow graph, O_minThe minimum value of the optical flow information is-20, O_maxThe maximum value of the optical flow information is 20, and the value of N is 256.

And then, sending the quantized optical flow graph into a convolutional neural network resnet50, and training the model by using behavior recognition as a classification result in cooperation with a softmax classification network. Different from the traditional three-channel RGB image, the quantized optical flow image is two channels, so for the first layer convolution layer, the convolution kernel channel parameter 3 needs to be modified into 2 so as to be suitable for the input of the optical flow graph. Then, performing classification training by using an adam optimizer. And then, for each target individual (actor), carrying out local extraction on the global motion mode characteristics one by one to obtain a classification model and storing the output characteristics. And finally, extracting the 1024-dimensional motion mode features to obtain 12 x 1024-dimensional feature vectors. This feature is used in subsequent modules for global inference of inter-object motion relationships.

4. Global reasoning module

The module performs feature fusion on the obtained features of the individual levels (operator-level) to obtain features of frame levels (frame-level). For each target node, the key to the interaction is to encode the information transfer from the motion expression and other nodes. The GRU is used as a core component of this module.

The GRU cell has two important components, a reset gate (reset) and an update gate (update), which are formulated as follows:

r＝σ(U_r·concat(x,h_t)) (6)

z＝σ(U_z·concat(x,h_t)) (7)

where σ is a sigmoid activation function, concat represents the concatenation operation of two vectors, U_rAnd U_zIs a learnable weight matrix, and the weight is obtained by training and learning together with a subsequent module. . h is_tIs the previous hidden layer state. Input x and h_tWith the same dimensions. Activation unit h used_t+1The expression is as follows:

wherein tanh is the activation function, U_xAnd V represents the input and the connected weight matrix of the hidden layer to the candidate state at the previous moment, respectively, the weights are obtained by training and learning together with the subsequent modules ⊙ represents the bit-wise multiplication (element-wise), i.e. the multiplication of the corresponding bits of the vectorIt is possible to control the amount of information that is passed from the previous state to the current hidden state, allowing for a more efficient presentation by updating the gate.

Optical flow-GRU (opt-GRU for short) and Relation-GRU are proposed to encode the two features described above to deliver a message. Opt-GRU takes apparent characteristics fa of a target individual (operator) as nodes as an initial hidden state and takes the characteristics of a motion mode of the target individual (operator) as input; relationship-GRU also uses the apparent feature f_AAs an initial hidden state, and taking the relation modal characteristics of the target individual (actor) as input;

obtaining the comprehensive expression h of the characteristics_t+1. In this section, fusion was performed using the method of average-posing:

wherein

Is the output of the opt-GRU,

representing the output of the relationship-GRU. h is_t+1An integrated vector for fusing the two GRUs output information. Finally, a maximum value pooling operation is needed to obtain the global information characteristics of the frame level (frame-level) of the aggregation finishing. And each frame of image in the video is enabled to obtain a global reasoning characteristic with a dimension of 1024.

Wherein, the comprehensive expression characteristics h are extracted_t+1Parameter U of middle need training_r、U_z、U_xAnd the V part homonymy characteristic extraction module and the time domain fusion module are trained together without independent parameter training.

5. Time domain fusion module

First, a set of frame characteristics is given as a node (node) characteristic, h ═ h₁,h₂…h_nWhere n is the number of nodes. Conversion of input features into high-level features h in order to obtain sufficient expressive power' a linear transformation is needed that allows for parameter training. A shared linear transformation using a weight matrix W is applied to each node:

a_i＝softmax(tanh(Wh_i)) (11)

wherein, a_iRepresents the attention distribution coefficient, and W represents the learning weight, which is found by training learning. h is_iRepresenting node characteristics, tanh is an activation function, and softmax represents a normalized exponential function. h' represents the high-level features of the output.

And then applying a softmax classification network for final classification. The classification of the whole model is trained by using a standard cross-entropy loss function (cross-entropy loss), and finally the recognition task of volleyball group behaviors is realized.

And performing common modeling training on the relational feature extraction module, the global reasoning module and the time domain fusion module, and performing weight parameter learning by using the volleyball group behavior labels as supervision. The training process used the adam optimizer, the training set 100 rounds, and the learning rate set to 0.001. In the parameter setting of the method, the model can be converged to obtain the optimal recognition accuracy rate of 93% during the 45 th round of training.

Claims

1. A volleyball group behavior recognition method based on multi-mode information fusion is characterized in that the following modules are designed and applied: the system comprises an apparent feature extraction module, a relation feature extraction module, a motion mode feature extraction module, a global reasoning module and a time domain fusion module;

the selected volleyball match video images and the individual marking frames thereof are used as input, and the apparent feature extraction module is used for extracting individual features of the selected volleyball match video images and outputting the apparent features of each individual of each image; the relationship feature extraction module takes the individual apparent features and the individual rectangular boxes as input and outputs relationship features for expressing the interaction relationship between individuals; the motion mode feature extraction module takes the video image as input and outputs motion mode features for expressing the global motion state of the image; and then, carrying out feature fusion on the individual apparent features, the relationship features and the motion mode features in sequence through a global reasoning module and a time domain fusion module, analyzing by combining the fused features, and finally outputting a volleyball group behavior recognition result.

2. The method of claim 1, wherein the contents of each module are as follows:

1) an apparent feature extraction module

The first module is an apparent feature extraction module which extracts the apparent features of each target individual in the image as multi-modal information; the module extracts the apparent characteristics of each target individual, namely the player, by using a trained deep convolutional neural network model according to the position marking information of each individual in the target image; the individual apparent features are features which are abstractly extracted from image RGB information distribution based on a convolutional neural network and are used for expressing image semantic information;

firstly, extracting a full map feature from volleyball video images by using a trained deep convolution neural network model, and then processing the corresponding relation between a candidate frame (bounding box) of each participating target (operator) and the full map feature by applying a RoI-Align mechanism in a Mask-RCNN algorithm model so as to complete feature extraction of each target individual; then, carrying out vector alignment on the features by using a full connection layer, and obtaining a D-dimensional apparent feature vector of each target individual through the full connection layer;

the number of targets in a certain frame of the video is N, and a matrix with N × D dimension is used for representing the feature vectors of all targets, wherein N is the number of the targets, and D is the size of the relational feature;

2) relational feature extraction module

The second module is a relational feature extraction module which extracts the relational features of each target individual in the image as information of a new mode; firstly, extracting geometric information characteristics from geometric coordinates of each target rectangular frame in an image by using a bounding box target regression (bounding box regression target) formula, and then carrying out relational modeling and characteristic expression on geometric information and apparent information by using a relational modeling method in a relationship Network algorithm model on the extracted geometric position information; extracting the characteristics of the relation between the targets through a series of nonlinear transformation and an attention mechanism based on the size relation and the geometric position relation between the targets;

firstly, embedding geometric features between any two targets in an image into a K-dimensional high-dimensional space for expression based on a bounding box target regression formula, wherein the geometric position labels of target individuals are provided by a public data set 'Volleyball'; then combining the geometric information of the high-dimensional expression with the apparent characteristic information, and executing a series of nonlinear transformation through the operation of weight training; outputting the relation expression between every two targets into a feature vector of a D dimension;

3) motion pattern feature extraction module

The third module is a motion mode feature extraction module which extracts the motion mode features of the image as information of a new mode; sending the optical flow quantization graph of the target image into a trained residual error network classification model, wherein the obtained characteristics are characteristic vectors expressing the motion mode of the whole image scene;

firstly, extracting an optical flow graph from a selected adjacent video image by using an optical flow extraction network PWC-NET to obtain an optical flow image for expressing image motion; then, carrying out quantization processing on the light flow graph, and mapping the numerical value used for expressing the pixel motion degree to a color space in a range of 0-255 to obtain a quantized light flow graph; finally, the quantized optical flow graph is sent into a trained depth classification model, and the motion mode expression characteristics of the image scene are obtained;

4) global reasoning module

The fourth module is a global reasoning module which has the function of integrating the multi-modal characteristic information extracted by the modules; sending the multi-modal information into a trained recurrent neural network sequence model GRU, realizing effective coding and global reasoning of the information, and fusing individual apparent characteristics, relationship characteristics and image motion mode characteristics;

a group of feature fusion modules of Optical flow-GRU (Opt-GRU for short) and R is providedThe evolution-GRU is used for coding different characteristics to transmit messages, so that the function of semantic information global reasoning is realized; first, the multi-modal information is summarized, and the apparent feature f is_ARelation characteristic f_RAnd a motion pattern characteristic f_OPerforming vertical splicing deformation to meet the input format of the GRU; then, the apparent feature f is used_AThe hidden unit input of the two GRU modules is used for relationship reasoning, and the multi-modal feature information respectively output by the relationship feature extraction module and the motion mode feature extraction module is input into the relationship-GRU and the Opt-GRU respectively, and feature vectors output by the two GRUs are fused by using average pooling operation; finally, maximum value pooling operation is needed to obtain the global information characteristics of the frame level (frame-level) which is aggregated and sorted; obtaining a global reasoning characteristic with dimension D for each frame of image in the video;

5) time domain fusion module

The fifth module is a time domain fusion module which fuses the characteristics of each frame of the video in a time domain angle; the module integrates the information obtained by the global reasoning module from the time domain perspective through an attention mechanism algorithm and outputs a final recognition result

The method comprises the steps of sequentially sending selected partial volleyball video images, respectively extracting apparent features, relationship features and motion mode features of the partial volleyball video images, obtaining global reasoning features in a global reasoning module by using a GRU model, obtaining the global features of a frame level from each frame in a video for one volleyball group, inputting all the global features obtained under the same group event into an attention layer (attention layer), carrying out dimension reduction fusion on the features of the frame level (frame-level) into the features of a sequence level (sequentiall-level) according to parameter setting of self-attention, finally sending the fusion features into a trained classification network layer (Softmax L eye), and finally outputting a behavior recognition result of the volleyball group.

3. The method according to claim 1, characterized by the following steps:

based on the annotations provided by the "Volleyball" dataset, the population events are classified into the following 8 categories: first left pass (l _ pass), second left pass (l _ set), left ball catch (l _ spike), left score (l _ winpoint), first right pass (r _ pass), second right pass (r _ set), right ball catch (r _ spike) and right score (r _ winpoint);

1) an apparent feature extraction module

In a Volleyball data set, each video sequence consists of 21 match video frames with player position marks, rectangular frame marks of each player target are provided in the data set, and only the 5 frames before and 4 frames after a key frame are adopted when a network model is trained, and ten images are counted; making it a source identification image with volleyball group as an event;

in the process of training the deep network for extracting the apparent features, the resnet-50 is selected as a backbone network, so that the feature extraction effectiveness is ensured, and the calculation cost is reduced; after extracting multi-scale features from the target image, the backbone network integrates the position coordinate information of different target individuals by using a roi-align processing algorithm, so that the model obtains the apparent features of each player respectively; finally, integrating the characteristics of each target individual by using a maximum pooling method, and classifying the integrated characteristics by using a softmax layer; in the training process, an emb _ features parameter of the backbone network is set to 2048, and the apparent feature size is set to 1024;

the data for training are divided according to training, verifying and testing sets given by the volleyball official, 200 training rounds are performed in total, and the learning rate is set to be 0.00001;

in the process of extracting the features, a filling method is designed corresponding to the phenomenon that the number of the operators in individual image frames is inconsistent, and the filling method is used for extracting apparent features with the same dimension; in the images with the number of the targets less than N, N is 12 in the Volleyball data set, and candidate frames with the largest long edges in the existing targets are sequentially copied and filled; then, using the trained model to extract the characteristics of the model, and storing the model offline; the extraction of the apparent features of 12 x 1024 dimensions in each picture is realized;

2) relational feature extraction module

In the Volleyball dataset, each player is included in each frame imagePosition coordinate information of the target (actor), thereby obtaining the apparent feature f of each player through the apparent feature extraction module_A(ii) a In the module, coordinate information is converted into high-order space expression through a bounding box target regression formula (bounding box regression target), and the geometric characteristic f of the coordinate information is defined_G(ii) a Embedding 4-dimensional rectangular box (bounding box) information originally labeled as each target individual into a 64-dimensional high-dimensional space through the following formula (1) for representing geometric information between target boxes; assuming N targets, the geometrical relationship between the ith and jth targets is expressed as:

f_Grepresenting geometric characteristics, wherein x, y, w and h respectively represent the horizontal and vertical coordinates of the upper left corner of the rectangular frame and the width and height of the rectangular frame; the subscripts i and j in the formula represent the number of the target;

for each frame in the volleyball event video, obtaining the apparent characteristics f of N target individuals_AGeometric feature f_G(ii) a Relational characteristics f of all target individuals_R(i) Is calculated as follows:

relationship characteristic f in formula (2)_RIs a weighted sum of the apparent characteristics of the target individual,

apparent feature representing jth target by weight W_VLinear transformation is carried out, and the weight is obtained through co-training and learning with a subsequent module; relation weight w^ijTo represent the impact from between the i and j targets, expressed as follows:

formula (II)(3) Middle apparent weight

Geometric weight calculated by equation (4)

Calculated by the formula (5), and

and

here we mean the normalization of the jth target for the k dimension;

w in formula (4)_kAnd W_qRespectively are the apparent features of the map

And

a weight matrix to subspace, the weight is obtained by training and learning together with the subsequent modules, ⊙ represents the bit-wise multiplication (element-wise) operation in the formula, namely, the corresponding bit multiplication of the vector, d_kRepresenting the projected feature size; in the formula (5), function

Represents the calculation procedure of formula (1), f_gFour-dimensional coordinates, W, representing a rectangular box_GRepresenting a learning weight, wherein the weight is obtained by training and learning together with a subsequent module;

in summary, the geometric features among 2 target individuals are embedded into a 64-dimensional space for high-dimensional expression, and the geometric features f are expressed as N × K dimensions_G(N is the number of actors, K is the geometric feature size); embedded feature pass W_GConverting to scalar weights and then performing a non-linear operation; non-linear operation limits the relationship between objects having a certain geometric relationship; finally, the relational expression of each actor is shaped into a relational feature f of D-dimension size_R(ii) a N is set to 12, K is set to 64, d_kSet to 64, D to 1024; obtaining 12 x 1024 relational feature expression;

wherein the geometric feature f_GExtracting in advance according to the numerical value of the target frame, storing the numerical value into an offline file for facilitating subsequent calculation, and extracting the geometric feature f_RParameter W of middle need of training_V、W_G、W_kAnd W_qPart of the data are obtained by training together with the global reasoning module and the time domain fusion module without independent parameter training;

3) motion pattern feature extraction module

In the module, firstly, a corresponding optical flow graph is extracted from a volleyball video by utilizing an optical flow extraction network PWC-Net which is pre-trained on a UCF101 data set, and an output result is stored; the output optical flow image needs to be calculated by selecting two adjacent frames, 10 frames of images before and after the key frame are used for identification, and correspondingly, one frame needs to be additionally supplemented after the 10 th frame to obtain the same number of optical flow graphs so as to facilitate subsequent calculation; based on observation and statistics of the output light flow graph, the motion information value is [ -20,20 [ -20 [ ]]Filtering the specified range, and quantizing the motion information outside the range into-20 and-20 respectively, thereby achieving the purpose of filtering noise information; then [ -20,20 [ -20]The values in the range are scaled equally and mapped to [0, 255 ]]In the color expression space of (2), the calculation process is as shown in formula (4);wherein V_oFor movement information corresponding to the light-flow graph, O_minThe minimum value of the optical flow information is-20, O_maxThe maximum value of the optical flow information is 20, and the value of N is 256;

then, the quantized optical flow graph is sent to a convolutional neural network resnet50, and a model is trained by matching with a softmax classification network and taking behavior recognition as a classification result; different from the traditional three-channel RGB image, the quantized optical flow image is two channels, so that for the first layer of convolution layer, the channel parameter of a convolution kernel needs to be modified from 3 to 2 so as to be suitable for the input of the optical flow graph; then, performing classification training by using an adam optimizer in a matching manner; then, for each target individual, performing local extraction on the global motion mode features one by one to obtain a classification model and storing output features; finally, extracting the characteristics of the motion mode with the size of 1024 dimensions to obtain characteristic vectors with the dimensions of 12 x 1024; the feature is used for global inference of motion relation between targets in a subsequent module;

4) global reasoning module

The module performs feature fusion on the obtained features of the individual levels (operator-level) to obtain features of frame levels (frame-level); for each target node, the key to the interaction is to encode the information transfer from the motion expression and other nodes; using a GRU as a core component of the module;

r＝σ(U_r·concat(x，h_t)) (6)

z＝σ(U_z·concat(x，h_t)) (7)

where σ is a sigmoid activation function, concat represents the concatenation operation of two vectors, U_rAnd U_zThe weight matrix can be learned, and the weight is obtained by training and learning together with a subsequent module; h is_tIs as beforeHiding the layer state; input x and h_tHave the same dimensions; activation unit h used_t+1The expression is as follows:

wherein tanh is the activation function, U_xAnd V represents the input and the connected weight matrix of the hidden layer to the state to be selected at the previous moment respectively, the weight is obtained by training and learning together with the subsequent module ⊙ represents the bit-wise multiplication (element-wise), namely the multiplication of the corresponding bits of the vector, in the expression, the memory unit (cell) allows the hidden state to remove any information which is not related to the input after finding through the reset gate, on the other hand, the memory unit can control the quantity of the information which is transmitted from the previous state to the current hidden state, thereby allowing more effective expression through the update gate;

optical flow-GRU (opt-GRU for short) and Relation-GRU are proposed, which are used for encoding the two characteristics to transmit messages; Opt-GRU takes the apparent characteristic fa of the target individual as a node as an initial hidden state and takes the motion mode characteristic of the target individual as an input; relationship-GRU also uses the apparent feature f_AAs an initial hidden state, and taking the relation modal characteristics of the target individual as input;

obtaining the comprehensive expression h of the characteristics_t+1(ii) a In this section, fusion was performed using the method of average-posing:

wherein

Is the output of the opt-GRU,

representing the output of the relationship-GRU; h is_t+1Outputting an integration vector for fusing two GRUs; finally, maximum value pooling operation is needed to obtain the global information characteristics of the frame level (frame-level) which is aggregated and sorted; obtaining a global reasoning characteristic with a dimension of 1024 for each frame of image in the video;

wherein, the comprehensive expression characteristics h are extracted_t+1Parameter U of middle need training_r、U_z、U_xThe part V homonymy characteristic extraction module and the time domain fusion module are trained together without independent parameter training;

5) time domain fusion module

First, a set of frame characteristics is given as a node (node) characteristic, h ═ h₁，h₂...h_nWhere n is the number of nodes; in order to obtain enough expressive power to convert the input features into high-level features h', a linear conversion capable of parameter training is needed; a shared linear transformation using a weight matrix W is applied to each node:

a_i＝softmax(tanh(Wh_i)) (11)

wherein, a_iRepresenting the attention distribution coefficient, and W representing the learning weight, wherein the weight is obtained through training learning; h is_iRepresenting node characteristics, tanh is an activation function, and softmax represents a normalized exponential function; h' represents the high-level features of the output;

then applying a softmax classification network to carry out final classification; the classification of the whole model utilizes a standard cross entropy loss function (cross-entropy loss) to complete training, and finally the recognition task of volleyball group behaviors is realized;

and performing common modeling training on the relational feature extraction module, the global reasoning module and the time domain fusion module, and performing weight parameter learning by using the volleyball group behavior labels as supervision.