CN111401174A - Volleyball group behavior identification method based on multi-mode information fusion - Google Patents

Volleyball group behavior identification method based on multi-mode information fusion Download PDF

Info

Publication number
CN111401174A
CN111401174A CN202010154331.6A CN202010154331A CN111401174A CN 111401174 A CN111401174 A CN 111401174A CN 202010154331 A CN202010154331 A CN 202010154331A CN 111401174 A CN111401174 A CN 111401174A
Authority
CN
China
Prior art keywords
features
module
information
target
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010154331.6A
Other languages
Chinese (zh)
Other versions
CN111401174B (en
Inventor
毋立芳
付亨
简萌
徐得中
袁元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202010154331.6A priority Critical patent/CN111401174B/en
Publication of CN111401174A publication Critical patent/CN111401174A/en
Application granted granted Critical
Publication of CN111401174B publication Critical patent/CN111401174B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A volleyball group behavior recognition method based on multi-mode information fusion is applied to the field of computer vision group behavior recognition. Due to the wide application in the aspects of sports analysis, automatic video monitoring systems, man-machine interaction applications, video recommendation systems and the like, group behavior recognition tasks are concerned. For group behavior recognition in a multi-person scenario, modeling of relationships between objects and motion patterns can provide discriminative visual cues. The invention aims to introduce the relation between image targets and the motion mode as multi-modal information and then effectively encode and globally infer the information by using a sequence model GRU. Finally, based on the attention mechanism, the obtained information of the inference module is integrated from the time domain perspective and the final result is obtained. The method realizes the group behavior identification in the volleyball data set, verifies the feasibility of the method through testing, and has important application value.

Description

Volleyball group behavior identification method based on multi-mode information fusion
Technical Field
The invention is applied to the field of computer vision group behavior identification, and particularly relates to digital image processing and deep learning technologies such as optical flow feature extraction, appearance feature extraction, a recurrent neural network and an attention mechanism. The method takes volleyball broadcast sports videos as input images, extracts apparent features, motion mode features and relation features of target images through a depth model, then uses a recurrent neural network and an attention mechanism to perform feature fusion, and summarizes multi-mode information results to realize a behavior recognition task of a multi-user group.
Background
Group behavior recognition is a comprehensive analysis task, and is widely applied in aspects of intelligent sports analysis, automatic video monitoring, human-computer interaction application, video recommendation systems and the like, so that the group behavior recognition is concerned. In order for a computer to intelligently understand the behavior occurring in a multi-person scenario, the designed model needs to describe not only the individual behavior of each target in the scenario, but also infer their group behavior. The ability to accurately capture the corresponding relationships between people and perform relationship inference is crucial to understanding multi-person group behavior. However, modeling relationships between people is a challenging task, as we typically focus on individual and group behaviors, and do not take full advantage of potential interaction information. It is therefore desirable to infer relationships between the participating target persons from the apparent features and relative position and movement pattern information. Therefore, when we design an effective depth model for group behavior understanding, we need to integrate these important clues to perform inference.
Disclosure of Invention
In order to realize the volleyball group behavior recognition function, a group behavior recognition scheme based on multi-mode information fusion is provided, and the flow of the method is shown in fig. 1. The method takes volleyball match video images as input, each module finishes feature extraction and feature analysis on the video images, and finally outputs the volleyball group identified by a system as a category. Specifically, the method comprises the steps of firstly, selecting partial images before and after a key frame in a Volleyball game broadcast video sequence according to a group behavior tag, wherein image materials of the partial images are from a Volleyball public data set; then extracting the apparent characteristics of each target individual (player) by using a trained deep convolutional neural network model according to the position marking information of each individual in the target image, wherein the marking information of the individual position is also provided by the data set; then, carrying out optical flow extraction on the images of two adjacent frames by using an optical flow extraction network model to obtain an optical flow graph, and sending the optical flow graph into a trained depth network after carrying out quantization processing on the optical flow graph to obtain a motion mode expression characteristic of an image scene; then, modeling and expressing the relationship characteristics by using an attention mechanism based on the geometric information extracted from the rectangular frame coordinates of each target individual and the apparent characteristics of each target; then, effectively coding and globally reasoning the multi-modal information by using a recurrent neural network sequence model GRU, and fusing the characteristics; and finally, integrating the obtained information of the inference module from the aspect of time domain based on an attention mechanism and obtaining a final recognition result. The general framework of the method is shown in fig. 2, and the following modules are mainly designed and applied: the system comprises an apparent feature extraction module, a relation feature extraction module, a motion mode feature extraction module, a global reasoning module and a time domain fusion module. Through the cooperative processing of the modules, effective information of multiple modes contained in the video image is extracted and effectively combined, so that the action recognition function of the volleyball competition group is realized.
The invention contents of each main module of the method are as follows:
1. apparent feature extraction module
The first module is an apparent feature extraction module, which functions to extract the apparent features of each target individual in the image as a kind of multi-modal information. This module extracts the apparent features of each target individual (player) by using a trained deep convolutional neural network model based on the position labeling information of each individual in the target image. The image appearance feature is a feature which is abstractly extracted from image RGB information distribution based on a convolutional neural network and is used for expressing image semantic information. As an important component of multimodal information, apparent features play an important role in identifying group behavior.
Firstly, a trained deep convolutional neural network model is used for extracting full-map features from volleyball video images, then a RoI-Align mechanism in a Mask-RCNN algorithm model is applied to process the corresponding relation between a candidate box (bounding box) of each participated target individual (operator) and the full-map features, so that the feature extraction of each target individual is completed, then, full-link layers are used for carrying out vector alignment on the features, and D-dimensional apparent feature vectors of each target individual are obtained through the full-link layers.
2. Relational feature extraction module
The second module is a relational feature extraction module, and the function of the module is to extract the relational features of each target individual in the image as information of a new modality. Firstly, geometric information features are extracted from geometric coordinates of each target rectangular frame in an image by using a bounding box object regression (bounding box regression target) formula, and then the geometric information and the apparent information are subjected to relational modeling and feature expression by using a relational modeling method in a relationship Network algorithm model for the extracted geometric position information. The inter-target relation features are extracted through a series of nonlinear transformation and an attention mechanism based on the size relation and the geometric position relation between targets. As an important component of multi-modal information, the relationship features play a role in feature enhancement through a method of embedding the apparent features.
Firstly, embedding geometric features between any two targets in an image into a high-dimensional space with K dimension (K is a high-dimensional space dimension coefficient) for expression based on a bounding box target regression formula, wherein the geometric position labels of the target individuals are provided by a public data set 'Volleyball', then combining the high-dimensional geometric information with the dimension of K with apparent feature information, and executing a series of nonlinear transformation through weight training operation.
3. Motion pattern feature extraction module
The third module is a motion mode feature extraction module, and the function of the module is to extract the motion mode features of the image as information of a new mode. And (3) sending the optical flow quantization diagram of the target image into a trained residual error network classification model, wherein the obtained characteristics are characteristic expression of the motion mode of the whole image scene. The image motion mode features are abstractly extracted based on image time sequence change, are used for expressing motion information of target images and motion relations between targets, and are also important components of multi-modal information.
The method comprises the steps of firstly, using an optical flow extraction network PWC-NET to extract an optical flow graph of a selected adjacent video image to obtain an optical flow image for expressing image motion, then, carrying out quantization processing on the optical flow graph, using the value of the pixel motion degree to map to a color space in a range of 0-255 to obtain a quantized optical flow graph, finally, sending the quantized optical flow graph to a trained depth classification model to obtain motion mode expression features of an image scene, and finally outputting a feature vector with a dimension of N × D, wherein the flow of the module is shown in figure 5.
4. Global reasoning module
The fourth module is a global reasoning module, and the function of the fourth module is to integrate the multi-modal feature information extracted by the modules. And sending the multi-mode information into a trained recurrent neural network sequence model GRU, realizing effective coding and global reasoning of the information, and fusing the apparent characteristics, the relation characteristics and the image motion mode characteristics of the target individual.
In order to fuse a plurality of characteristics to facilitate understanding of group behaviors, a GRU model is introduced to model the GRU model facing to paired relation information. The GRU may remember long-term information as an effective memory unit, and the GRU Cell may choose to ignore certain portions of the target state that are not relevant to the current motion expression, or to use multimodal information to enhance certain portions of the target state.
A group of feature fusion modules, namely, Opt-GRU and relationship-GRU, are provided in the method and are used for coding different features to transmit messages, so that the function of semantic information global reasoning is realized. Firstly, multi-mode information is gathered, and various feature information apparent features fa, relationship features fr and motion mode features fopt are subjected to vertical splicing deformation on an N channel so as to conform to the input format of a GRU. Then, the apparent feature fa is used as a hidden unit input of the two GRUs for relationship inference, and feature vectors output by the two GRUs are fused using an averaging pooling operation for the multi-modal feature information fr and fopt respectively output by the relationship feature extraction module and the motion mode feature extraction module which are input by the relationship-GRU and Opt-GRU. Finally, a maximum value pooling operation is needed to obtain the global information characteristics of the frame level (frame-level) of the aggregation finishing. And obtaining the global reasoning characteristic with the dimension D for each frame of image in the video. The specific flow of the module is shown in fig. 6.
5. Time domain fusion module
The fifth module is a time domain fusion module which has the function of fusing the characteristics of each frame of the video in the time domain. The module integrates the information obtained by the global reasoning module from the time domain perspective through an attention mechanism algorithm and outputs a final recognition result
According to the above modules, the multi-modal characteristics of a certain frame in the video are obtained, however, for a period of time sequential video, the time domain information is also very important. Since each frame in the video contributes differently to the entire event in the time domain, the semantic information of the frame is utilized in the present invention, and the features of the frame level (frame-level) are further integrated with the time domain information to form the features of the sequence level (sequential-level).
The invention inputs all global features obtained under the same group event into an Attention layer (Attention layer), wherein the global features are set according to self-authorization parameters, the features of frame levels (frame-levels) are fused into the features of sequence levels (sequence-levels) in a descending way, finally, the fused features are sent into a trained classification network layer (Softmax L eye), and finally, the recognition result of the behavior of the volleyball group is output, and the flow of the module is shown in FIG. 7.
Through the effective collocation of the modules, the group behavior recognition task of the volleyball match video is completed together. The selected volleyball match video images and the individual marking frames thereof are used as input, and the apparent feature extraction module is used for extracting individual features of the selected volleyball match video images and outputting the apparent features of each individual of each image; the relationship feature extraction module takes the individual apparent features and the individual rectangular boxes as input and outputs relationship features for expressing the interaction relationship between individuals; the motion mode feature extraction module takes the video image as input and outputs motion mode features for expressing the global motion state of the image; and then, carrying out feature fusion on the individual apparent features, the relationship features and the motion mode features in sequence through a global reasoning module and a time domain fusion module, analyzing by combining the fused features, and finally outputting a volleyball group behavior recognition result.
Drawings
FIG. 1 is a flow chart of a volleyball group behavior identification scheme;
FIG. 2 is a general framework of a volleyball group behavior recognition scheme;
FIG. 3 is an apparent feature extraction module framework;
FIG. 4 is a relational feature extraction module framework;
FIG. 5 is a motion pattern feature extraction module framework;
FIG. 6 is a global reasoning module framework;
FIG. 7 is a time domain fusion module framework;
FIG. 8 is an example of a volleyball group behavior video frame annotation;
FIG. 9 is an example of an RGB original image of volleyball game video and a light flow diagram thereof;
fig. 10 is an exemplary diagram of the classification result of the behavior of the volleyball group.
Detailed Description
Teaching how models for individual modules are trained
The invention provides a volleyball group behavior identification method based on multi-mode information fusion. Based on the annotations provided by the "Volleyball" dataset, the population events can be classified into the following 8 categories: left first pass (l _ pass), left second pass (l _ set), left ball snap (l _ spike), left score (l _ winpoint), right first pass (r _ pass), right second pass (r _ set), right ball snap (r _ spike), and right score (r _ winpoint).
The method comprises the following specific implementation steps:
1. apparent feature extraction module
The apparent features are an important component as one of the multi-modal features that identify group behavior. The invention uses a resnet-50 residual error network model as the backbone network of the module, and processes targets at different positions by matching with a roi-align method.
In the Volleyball dataset, each video sequence consists of 21 game video frames with player position labels, a rectangular box label for each player target is provided in the dataset, and only the 5 frames before and 4 frames after the key frame are used for training the network model, for a total of ten images. Let it be a source identification image of a volleyball group as an event.
In the process of training the deep network for extracting the apparent features, the resnet-50 is selected as the backbone network, so that the feature extraction effectiveness is ensured, and the calculation cost is reduced. After extracting multi-scale features from the target image, the backbone network integrates the position coordinate information of different target individuals (actors) by using a roi-align processing algorithm, so that the model obtains the apparent features of each player respectively. And finally, integrating the characteristics of each target individual (actor) by using a maximum pooling method, and classifying the integrated characteristics by using a softmax layer. In the training process, an emb _ features parameter of the backbone network is set to 2048, and the apparent feature size is set to 1024;
the data for training was partitioned with reference to the training, validation, test set given by the volleyball official, for 200 rounds of training, the learning rate was set to 0.00001,
in the process of extracting the features, a filling method is designed for extracting apparent features with the same dimension corresponding to the phenomenon that the number of the operators in individual image frames is inconsistent. That is, in images with the number of objects smaller than N, N is 12 in the Volleyball dataset, and the candidate frames with the largest long sides among the existing objects are sequentially copied and filled. And then, using the trained model to perform feature extraction on the model, and storing the model offline. And finally, the extraction of the apparent features with the dimension of 12 x 1024 in each picture is realized.
2. Relational feature extraction module
The relational features are used to represent the relationship between target individuals (actors) as a kind of multi-modal information that reinforces the apparent features. In order to construct the expression of the Relation between the operators, the part is improved based on a basic Network model 'relationship Network' for representing the Relation.
In the Volleyball data set, the position coordinate information of each player target (actor) is included in each frame image, so that the apparent characteristics f of each player are obtained through the apparent characteristic extraction moduleA. In the module, coordinate information is converted into high-order space expression through a bounding box target regression formula (bounding box regression target), and the geometric characteristic f of the coordinate information is definedG. Information of a 4-dimensional rectangular box (bounding box) originally labeled as each target individual (actor) is embedded under a high-dimensional space of 64 dimensions by the following formula (1) for representing geometric information between target boxes. Assuming N targets, the geometrical relationship between the ith and jth targets is expressed as:
Figure BDA0002403548220000071
fGrepresenting the geometrical characteristics, x, y, w and h respectively represent the horizontal and vertical coordinates of the upper left corner of the rectangular frame and the width and height of the rectangular frame. The subscripts i and j in the formula represent the number of the object.
For each frame in the volleyball event video, obtaining the apparent characteristics f of N target individuals (actors)AGeometric feature fG. Relational characteristics f of all target individuals (actors)R(i) Is calculated as follows:
Figure BDA0002403548220000081
relationship characteristic f in formula (2)RIs a weighted sum of apparent features of the target individual (actor),
Figure BDA0002403548220000082
apparent feature representing jth target by weight WVAnd (5) performing linear transformation, wherein the weight is obtained by training and learning together with a subsequent module. Relation weight wijTo represent the impact from between the i and j targets, expressed as follows:
Figure BDA0002403548220000083
appearance weight in equation (3)
Figure BDA0002403548220000084
Geometric weight calculated by equation (4)
Figure BDA0002403548220000085
Calculated by the formula (5), and
Figure BDA0002403548220000086
and
Figure BDA0002403548220000087
the calculation method of (2) is consistent with the formulas (4) and (5), i, j and k in the corner mark represent the targets from the ith, jth and kth, k represents the size of the geometric feature,
Figure BDA0002403548220000088
here we show the normalization of the jth target to the k dimension.
Figure BDA0002403548220000089
Figure BDA00024035482200000810
W in formula (4)kAnd WqRespectively are the apparent features of the map
Figure BDA00024035482200000811
And
Figure BDA00024035482200000812
a weight matrix to the subspace, the weights being derived by co-training with subsequent modules ⊙ in the formula represents a bit-wise multiplication (element-wise) operation, i.e. multiplication of corresponding bits of a vector dkRepresenting the feature size after projection. In the formula (5), function
Figure BDA00024035482200000813
Represents the calculation procedure of formula (1), fgFour-dimensional coordinates, W, representing a rectangular boxGRepresenting a learning weight, which is derived by co-training learning with the subsequent modules.
In summary, the geometric features among 2 target individuals (actors) are embedded into a 64-dimensional space for high-dimensional expression, and the geometric features f are expressed as N × K dimensionsG(N is the number of actors, K is the geometric feature size). Embedded feature pass WGConversion to scalar weights and then execution of a non-linear operation. Non-linear operation limits the relationship between objects having a certain geometric relationship. Finally, the relational expression of each actor is shaped into a relational feature f of D-dimension sizeR. N is set to 12, K is set to 64, dkSet to 64 and D to 1024. And obtaining 12 x 1024 relational feature expression.
Wherein the geometric feature fGExtracting in advance according to the numerical value of the target frame, storing the numerical value into an offline file for facilitating subsequent calculation, and extracting the geometric feature fRParameter W of middle need of trainingV、WG、WkAnd WqAnd the part is obtained by training together with the global reasoning module and the time domain fusion module without independent parameter training.
3. Motion pattern feature extraction module
The motion mode features are another important multi-modal information for enhancing the apparent features. An example of the volleyball game video artwork and light flow diagram correspondence is shown in fig. 9.
In the module, firstly, a corresponding optical flow graph is extracted from a volleyball video by using an optical flow extraction network PWC-Net which is pre-trained on a UCF101 data set, and an output result is stored. The output light flow image needs to be calculated by selecting two adjacent frames, 10 frames of images before and after the key frame are used for identification, and accordingly, one frame needs to be additionally added after the 10 th frame to obtain the same number of light flow graphs so as to facilitate subsequent calculation. Based on observation and statistics of the output light flow graph, the motion information value is [ -20,20 [ -20 [ ]]Filtering is carried out for a specified range, and the motion information outside the range is quantized into-20 and 20 respectively, so that the aim of filtering noise information is fulfilled. Then [ -20,20 [ -20]The values in the range are scaled equally and mapped to [0, 255 ]]In the color expression space of (2), the calculation process is as shown in equation 4. Wherein VoFor movement information corresponding to the light-flow graph, OminThe minimum value of the optical flow information is-20, OmaxThe maximum value of the optical flow information is 20, and the value of N is 256.
Figure BDA0002403548220000091
And then, sending the quantized optical flow graph into a convolutional neural network resnet50, and training the model by using behavior recognition as a classification result in cooperation with a softmax classification network. Different from the traditional three-channel RGB image, the quantized optical flow image is two channels, so for the first layer convolution layer, the convolution kernel channel parameter 3 needs to be modified into 2 so as to be suitable for the input of the optical flow graph. Then, performing classification training by using an adam optimizer. And then, for each target individual (actor), carrying out local extraction on the global motion mode characteristics one by one to obtain a classification model and storing the output characteristics. And finally, extracting the 1024-dimensional motion mode features to obtain 12 x 1024-dimensional feature vectors. This feature is used in subsequent modules for global inference of inter-object motion relationships.
4. Global reasoning module
The module performs feature fusion on the obtained features of the individual levels (operator-level) to obtain features of frame levels (frame-level). For each target node, the key to the interaction is to encode the information transfer from the motion expression and other nodes. The GRU is used as a core component of this module.
The GRU cell has two important components, a reset gate (reset) and an update gate (update), which are formulated as follows:
r=σ(Ur·concat(x,ht)) (6)
z=σ(Uz·concat(x,ht)) (7)
where σ is a sigmoid activation function, concat represents the concatenation operation of two vectors, UrAnd UzIs a learnable weight matrix, and the weight is obtained by training and learning together with a subsequent module. . h istIs the previous hidden layer state. Input x and htWith the same dimensions. Activation unit h usedt+1The expression is as follows:
Figure BDA0002403548220000101
Figure BDA0002403548220000102
wherein tanh is the activation function, UxAnd V represents the input and the connected weight matrix of the hidden layer to the candidate state at the previous moment, respectively, the weights are obtained by training and learning together with the subsequent modules ⊙ represents the bit-wise multiplication (element-wise), i.e. the multiplication of the corresponding bits of the vectorIt is possible to control the amount of information that is passed from the previous state to the current hidden state, allowing for a more efficient presentation by updating the gate.
Optical flow-GRU (opt-GRU for short) and Relation-GRU are proposed to encode the two features described above to deliver a message. Opt-GRU takes apparent characteristics fa of a target individual (operator) as nodes as an initial hidden state and takes the characteristics of a motion mode of the target individual (operator) as input; relationship-GRU also uses the apparent feature fAAs an initial hidden state, and taking the relation modal characteristics of the target individual (actor) as input;
obtaining the comprehensive expression h of the characteristicst+1. In this section, fusion was performed using the method of average-posing:
Figure BDA0002403548220000111
wherein
Figure BDA0002403548220000112
Is the output of the opt-GRU,
Figure BDA0002403548220000113
representing the output of the relationship-GRU. h ist+1An integrated vector for fusing the two GRUs output information. Finally, a maximum value pooling operation is needed to obtain the global information characteristics of the frame level (frame-level) of the aggregation finishing. And each frame of image in the video is enabled to obtain a global reasoning characteristic with a dimension of 1024.
Wherein, the comprehensive expression characteristics h are extractedt+1Parameter U of middle need trainingr、Uz、UxAnd the V part homonymy characteristic extraction module and the time domain fusion module are trained together without independent parameter training.
5. Time domain fusion module
First, a set of frame characteristics is given as a node (node) characteristic, h ═ h1,h2…hnWhere n is the number of nodes. Conversion of input features into high-level features h in order to obtain sufficient expressive power' a linear transformation is needed that allows for parameter training. A shared linear transformation using a weight matrix W is applied to each node:
ai=softmax(tanh(Whi)) (11)
Figure BDA0002403548220000114
wherein, aiRepresents the attention distribution coefficient, and W represents the learning weight, which is found by training learning. h isiRepresenting node characteristics, tanh is an activation function, and softmax represents a normalized exponential function. h' represents the high-level features of the output.
And then applying a softmax classification network for final classification. The classification of the whole model is trained by using a standard cross-entropy loss function (cross-entropy loss), and finally the recognition task of volleyball group behaviors is realized.
And performing common modeling training on the relational feature extraction module, the global reasoning module and the time domain fusion module, and performing weight parameter learning by using the volleyball group behavior labels as supervision. The training process used the adam optimizer, the training set 100 rounds, and the learning rate set to 0.001. In the parameter setting of the method, the model can be converged to obtain the optimal recognition accuracy rate of 93% during the 45 th round of training.

Claims (3)

1. A volleyball group behavior recognition method based on multi-mode information fusion is characterized in that the following modules are designed and applied: the system comprises an apparent feature extraction module, a relation feature extraction module, a motion mode feature extraction module, a global reasoning module and a time domain fusion module;
the selected volleyball match video images and the individual marking frames thereof are used as input, and the apparent feature extraction module is used for extracting individual features of the selected volleyball match video images and outputting the apparent features of each individual of each image; the relationship feature extraction module takes the individual apparent features and the individual rectangular boxes as input and outputs relationship features for expressing the interaction relationship between individuals; the motion mode feature extraction module takes the video image as input and outputs motion mode features for expressing the global motion state of the image; and then, carrying out feature fusion on the individual apparent features, the relationship features and the motion mode features in sequence through a global reasoning module and a time domain fusion module, analyzing by combining the fused features, and finally outputting a volleyball group behavior recognition result.
2. The method of claim 1, wherein the contents of each module are as follows:
1) an apparent feature extraction module
The first module is an apparent feature extraction module which extracts the apparent features of each target individual in the image as multi-modal information; the module extracts the apparent characteristics of each target individual, namely the player, by using a trained deep convolutional neural network model according to the position marking information of each individual in the target image; the individual apparent features are features which are abstractly extracted from image RGB information distribution based on a convolutional neural network and are used for expressing image semantic information;
firstly, extracting a full map feature from volleyball video images by using a trained deep convolution neural network model, and then processing the corresponding relation between a candidate frame (bounding box) of each participating target (operator) and the full map feature by applying a RoI-Align mechanism in a Mask-RCNN algorithm model so as to complete feature extraction of each target individual; then, carrying out vector alignment on the features by using a full connection layer, and obtaining a D-dimensional apparent feature vector of each target individual through the full connection layer;
the number of targets in a certain frame of the video is N, and a matrix with N × D dimension is used for representing the feature vectors of all targets, wherein N is the number of the targets, and D is the size of the relational feature;
2) relational feature extraction module
The second module is a relational feature extraction module which extracts the relational features of each target individual in the image as information of a new mode; firstly, extracting geometric information characteristics from geometric coordinates of each target rectangular frame in an image by using a bounding box target regression (bounding box regression target) formula, and then carrying out relational modeling and characteristic expression on geometric information and apparent information by using a relational modeling method in a relationship Network algorithm model on the extracted geometric position information; extracting the characteristics of the relation between the targets through a series of nonlinear transformation and an attention mechanism based on the size relation and the geometric position relation between the targets;
firstly, embedding geometric features between any two targets in an image into a K-dimensional high-dimensional space for expression based on a bounding box target regression formula, wherein the geometric position labels of target individuals are provided by a public data set 'Volleyball'; then combining the geometric information of the high-dimensional expression with the apparent characteristic information, and executing a series of nonlinear transformation through the operation of weight training; outputting the relation expression between every two targets into a feature vector of a D dimension;
3) motion pattern feature extraction module
The third module is a motion mode feature extraction module which extracts the motion mode features of the image as information of a new mode; sending the optical flow quantization graph of the target image into a trained residual error network classification model, wherein the obtained characteristics are characteristic vectors expressing the motion mode of the whole image scene;
firstly, extracting an optical flow graph from a selected adjacent video image by using an optical flow extraction network PWC-NET to obtain an optical flow image for expressing image motion; then, carrying out quantization processing on the light flow graph, and mapping the numerical value used for expressing the pixel motion degree to a color space in a range of 0-255 to obtain a quantized light flow graph; finally, the quantized optical flow graph is sent into a trained depth classification model, and the motion mode expression characteristics of the image scene are obtained;
4) global reasoning module
The fourth module is a global reasoning module which has the function of integrating the multi-modal characteristic information extracted by the modules; sending the multi-modal information into a trained recurrent neural network sequence model GRU, realizing effective coding and global reasoning of the information, and fusing individual apparent characteristics, relationship characteristics and image motion mode characteristics;
a group of feature fusion modules of Optical flow-GRU (Opt-GRU for short) and R is providedThe evolution-GRU is used for coding different characteristics to transmit messages, so that the function of semantic information global reasoning is realized; first, the multi-modal information is summarized, and the apparent feature f isARelation characteristic fRAnd a motion pattern characteristic fOPerforming vertical splicing deformation to meet the input format of the GRU; then, the apparent feature f is usedAThe hidden unit input of the two GRU modules is used for relationship reasoning, and the multi-modal feature information respectively output by the relationship feature extraction module and the motion mode feature extraction module is input into the relationship-GRU and the Opt-GRU respectively, and feature vectors output by the two GRUs are fused by using average pooling operation; finally, maximum value pooling operation is needed to obtain the global information characteristics of the frame level (frame-level) which is aggregated and sorted; obtaining a global reasoning characteristic with dimension D for each frame of image in the video;
5) time domain fusion module
The fifth module is a time domain fusion module which fuses the characteristics of each frame of the video in a time domain angle; the module integrates the information obtained by the global reasoning module from the time domain perspective through an attention mechanism algorithm and outputs a final recognition result
The method comprises the steps of sequentially sending selected partial volleyball video images, respectively extracting apparent features, relationship features and motion mode features of the partial volleyball video images, obtaining global reasoning features in a global reasoning module by using a GRU model, obtaining the global features of a frame level from each frame in a video for one volleyball group, inputting all the global features obtained under the same group event into an attention layer (attention layer), carrying out dimension reduction fusion on the features of the frame level (frame-level) into the features of a sequence level (sequentiall-level) according to parameter setting of self-attention, finally sending the fusion features into a trained classification network layer (Softmax L eye), and finally outputting a behavior recognition result of the volleyball group.
3. The method according to claim 1, characterized by the following steps:
based on the annotations provided by the "Volleyball" dataset, the population events are classified into the following 8 categories: first left pass (l _ pass), second left pass (l _ set), left ball catch (l _ spike), left score (l _ winpoint), first right pass (r _ pass), second right pass (r _ set), right ball catch (r _ spike) and right score (r _ winpoint);
1) an apparent feature extraction module
In a Volleyball data set, each video sequence consists of 21 match video frames with player position marks, rectangular frame marks of each player target are provided in the data set, and only the 5 frames before and 4 frames after a key frame are adopted when a network model is trained, and ten images are counted; making it a source identification image with volleyball group as an event;
in the process of training the deep network for extracting the apparent features, the resnet-50 is selected as a backbone network, so that the feature extraction effectiveness is ensured, and the calculation cost is reduced; after extracting multi-scale features from the target image, the backbone network integrates the position coordinate information of different target individuals by using a roi-align processing algorithm, so that the model obtains the apparent features of each player respectively; finally, integrating the characteristics of each target individual by using a maximum pooling method, and classifying the integrated characteristics by using a softmax layer; in the training process, an emb _ features parameter of the backbone network is set to 2048, and the apparent feature size is set to 1024;
the data for training are divided according to training, verifying and testing sets given by the volleyball official, 200 training rounds are performed in total, and the learning rate is set to be 0.00001;
in the process of extracting the features, a filling method is designed corresponding to the phenomenon that the number of the operators in individual image frames is inconsistent, and the filling method is used for extracting apparent features with the same dimension; in the images with the number of the targets less than N, N is 12 in the Volleyball data set, and candidate frames with the largest long edges in the existing targets are sequentially copied and filled; then, using the trained model to extract the characteristics of the model, and storing the model offline; the extraction of the apparent features of 12 x 1024 dimensions in each picture is realized;
2) relational feature extraction module
In the Volleyball dataset, each player is included in each frame imagePosition coordinate information of the target (actor), thereby obtaining the apparent feature f of each player through the apparent feature extraction moduleA(ii) a In the module, coordinate information is converted into high-order space expression through a bounding box target regression formula (bounding box regression target), and the geometric characteristic f of the coordinate information is definedG(ii) a Embedding 4-dimensional rectangular box (bounding box) information originally labeled as each target individual into a 64-dimensional high-dimensional space through the following formula (1) for representing geometric information between target boxes; assuming N targets, the geometrical relationship between the ith and jth targets is expressed as:
Figure FDA0002403548210000041
fGrepresenting geometric characteristics, wherein x, y, w and h respectively represent the horizontal and vertical coordinates of the upper left corner of the rectangular frame and the width and height of the rectangular frame; the subscripts i and j in the formula represent the number of the target;
for each frame in the volleyball event video, obtaining the apparent characteristics f of N target individualsAGeometric feature fG(ii) a Relational characteristics f of all target individualsR(i) Is calculated as follows:
Figure FDA0002403548210000051
relationship characteristic f in formula (2)RIs a weighted sum of the apparent characteristics of the target individual,
Figure FDA0002403548210000052
apparent feature representing jth target by weight WVLinear transformation is carried out, and the weight is obtained through co-training and learning with a subsequent module; relation weight wijTo represent the impact from between the i and j targets, expressed as follows:
Figure FDA0002403548210000053
formula (II)(3) Middle apparent weight
Figure FDA0002403548210000054
Geometric weight calculated by equation (4)
Figure FDA0002403548210000055
Calculated by the formula (5), and
Figure FDA0002403548210000056
and
Figure FDA0002403548210000057
the calculation method of (2) is consistent with the formulas (4) and (5), i, j and k in the corner mark represent the targets from the ith, jth and kth, k represents the size of the geometric feature,
Figure FDA0002403548210000058
here we mean the normalization of the jth target for the k dimension;
Figure FDA0002403548210000059
Figure FDA00024035482100000510
w in formula (4)kAnd WqRespectively are the apparent features of the map
Figure FDA00024035482100000511
And
Figure FDA00024035482100000512
a weight matrix to subspace, the weight is obtained by training and learning together with the subsequent modules, ⊙ represents the bit-wise multiplication (element-wise) operation in the formula, namely, the corresponding bit multiplication of the vector, dkRepresenting the projected feature size; in the formula (5), function
Figure FDA00024035482100000513
Represents the calculation procedure of formula (1), fgFour-dimensional coordinates, W, representing a rectangular boxGRepresenting a learning weight, wherein the weight is obtained by training and learning together with a subsequent module;
in summary, the geometric features among 2 target individuals are embedded into a 64-dimensional space for high-dimensional expression, and the geometric features f are expressed as N × K dimensionsG(N is the number of actors, K is the geometric feature size); embedded feature pass WGConverting to scalar weights and then performing a non-linear operation; non-linear operation limits the relationship between objects having a certain geometric relationship; finally, the relational expression of each actor is shaped into a relational feature f of D-dimension sizeR(ii) a N is set to 12, K is set to 64, dkSet to 64, D to 1024; obtaining 12 x 1024 relational feature expression;
wherein the geometric feature fGExtracting in advance according to the numerical value of the target frame, storing the numerical value into an offline file for facilitating subsequent calculation, and extracting the geometric feature fRParameter W of middle need of trainingV、WG、WkAnd WqPart of the data are obtained by training together with the global reasoning module and the time domain fusion module without independent parameter training;
3) motion pattern feature extraction module
In the module, firstly, a corresponding optical flow graph is extracted from a volleyball video by utilizing an optical flow extraction network PWC-Net which is pre-trained on a UCF101 data set, and an output result is stored; the output optical flow image needs to be calculated by selecting two adjacent frames, 10 frames of images before and after the key frame are used for identification, and correspondingly, one frame needs to be additionally supplemented after the 10 th frame to obtain the same number of optical flow graphs so as to facilitate subsequent calculation; based on observation and statistics of the output light flow graph, the motion information value is [ -20,20 [ -20 [ ]]Filtering the specified range, and quantizing the motion information outside the range into-20 and-20 respectively, thereby achieving the purpose of filtering noise information; then [ -20,20 [ -20]The values in the range are scaled equally and mapped to [0, 255 ]]In the color expression space of (2), the calculation process is as shown in formula (4);wherein VoFor movement information corresponding to the light-flow graph, OminThe minimum value of the optical flow information is-20, OmaxThe maximum value of the optical flow information is 20, and the value of N is 256;
Figure FDA0002403548210000061
then, the quantized optical flow graph is sent to a convolutional neural network resnet50, and a model is trained by matching with a softmax classification network and taking behavior recognition as a classification result; different from the traditional three-channel RGB image, the quantized optical flow image is two channels, so that for the first layer of convolution layer, the channel parameter of a convolution kernel needs to be modified from 3 to 2 so as to be suitable for the input of the optical flow graph; then, performing classification training by using an adam optimizer in a matching manner; then, for each target individual, performing local extraction on the global motion mode features one by one to obtain a classification model and storing output features; finally, extracting the characteristics of the motion mode with the size of 1024 dimensions to obtain characteristic vectors with the dimensions of 12 x 1024; the feature is used for global inference of motion relation between targets in a subsequent module;
4) global reasoning module
The module performs feature fusion on the obtained features of the individual levels (operator-level) to obtain features of frame levels (frame-level); for each target node, the key to the interaction is to encode the information transfer from the motion expression and other nodes; using a GRU as a core component of the module;
the GRU cell has two important components, a reset gate (reset) and an update gate (update), which are formulated as follows:
r=σ(Ur·concat(x,ht)) (6)
z=σ(Uz·concat(x,ht)) (7)
where σ is a sigmoid activation function, concat represents the concatenation operation of two vectors, UrAnd UzThe weight matrix can be learned, and the weight is obtained by training and learning together with a subsequent module; h istIs as beforeHiding the layer state; input x and htHave the same dimensions; activation unit h usedt+1The expression is as follows:
Figure FDA0002403548210000071
Figure FDA0002403548210000072
wherein tanh is the activation function, UxAnd V represents the input and the connected weight matrix of the hidden layer to the state to be selected at the previous moment respectively, the weight is obtained by training and learning together with the subsequent module ⊙ represents the bit-wise multiplication (element-wise), namely the multiplication of the corresponding bits of the vector, in the expression, the memory unit (cell) allows the hidden state to remove any information which is not related to the input after finding through the reset gate, on the other hand, the memory unit can control the quantity of the information which is transmitted from the previous state to the current hidden state, thereby allowing more effective expression through the update gate;
optical flow-GRU (opt-GRU for short) and Relation-GRU are proposed, which are used for encoding the two characteristics to transmit messages; Opt-GRU takes the apparent characteristic fa of the target individual as a node as an initial hidden state and takes the motion mode characteristic of the target individual as an input; relationship-GRU also uses the apparent feature fAAs an initial hidden state, and taking the relation modal characteristics of the target individual as input;
obtaining the comprehensive expression h of the characteristicst+1(ii) a In this section, fusion was performed using the method of average-posing:
Figure FDA0002403548210000073
wherein
Figure FDA0002403548210000074
Is the output of the opt-GRU,
Figure FDA0002403548210000075
representing the output of the relationship-GRU; h ist+1Outputting an integration vector for fusing two GRUs; finally, maximum value pooling operation is needed to obtain the global information characteristics of the frame level (frame-level) which is aggregated and sorted; obtaining a global reasoning characteristic with a dimension of 1024 for each frame of image in the video;
wherein, the comprehensive expression characteristics h are extractedt+1Parameter U of middle need trainingr、Uz、UxThe part V homonymy characteristic extraction module and the time domain fusion module are trained together without independent parameter training;
5) time domain fusion module
First, a set of frame characteristics is given as a node (node) characteristic, h ═ h1,h2...hnWhere n is the number of nodes; in order to obtain enough expressive power to convert the input features into high-level features h', a linear conversion capable of parameter training is needed; a shared linear transformation using a weight matrix W is applied to each node:
ai=softmax(tanh(Whi)) (11)
Figure FDA0002403548210000081
wherein, aiRepresenting the attention distribution coefficient, and W representing the learning weight, wherein the weight is obtained through training learning; h isiRepresenting node characteristics, tanh is an activation function, and softmax represents a normalized exponential function; h' represents the high-level features of the output;
then applying a softmax classification network to carry out final classification; the classification of the whole model utilizes a standard cross entropy loss function (cross-entropy loss) to complete training, and finally the recognition task of volleyball group behaviors is realized;
and performing common modeling training on the relational feature extraction module, the global reasoning module and the time domain fusion module, and performing weight parameter learning by using the volleyball group behavior labels as supervision.
CN202010154331.6A 2020-03-07 2020-03-07 Volleyball group behavior identification method based on multi-mode information fusion Active CN111401174B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010154331.6A CN111401174B (en) 2020-03-07 2020-03-07 Volleyball group behavior identification method based on multi-mode information fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010154331.6A CN111401174B (en) 2020-03-07 2020-03-07 Volleyball group behavior identification method based on multi-mode information fusion

Publications (2)

Publication Number Publication Date
CN111401174A true CN111401174A (en) 2020-07-10
CN111401174B CN111401174B (en) 2023-09-22

Family

ID=71430604

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010154331.6A Active CN111401174B (en) 2020-03-07 2020-03-07 Volleyball group behavior identification method based on multi-mode information fusion

Country Status (1)

Country Link
CN (1) CN111401174B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131943A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system based on dual attention model
CN112131944A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system
CN112528785A (en) * 2020-11-30 2021-03-19 联想(北京)有限公司 Information processing method and device
CN113065451A (en) * 2021-03-29 2021-07-02 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113239828A (en) * 2021-05-20 2021-08-10 清华大学深圳国际研究生院 Face recognition method and device based on TOF camera module
CN113297936A (en) * 2021-05-17 2021-08-24 北京工业大学 Volleyball group behavior identification method based on local graph convolution network
CN113836992A (en) * 2021-06-15 2021-12-24 腾讯科技(深圳)有限公司 Method for identifying label, method, device and equipment for training label identification model
CN114187546A (en) * 2021-12-01 2022-03-15 山东大学 Combined action recognition method and system
CN114863356A (en) * 2022-03-10 2022-08-05 西南交通大学 Group activity identification method and system based on residual aggregation graph network
US20230161000A1 (en) * 2021-11-24 2023-05-25 Smart Radar System, Inc. 4-Dimensional Radar Signal Processing Apparatus
CN117576784A (en) * 2024-01-15 2024-02-20 吉林大学 Method and system for recognizing diver gesture by fusing event and RGB data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194322A (en) * 2017-04-28 2017-09-22 南京邮电大学 A kind of behavior analysis method in video monitoring scene
CN108681712A (en) * 2018-05-17 2018-10-19 北京工业大学 A kind of Basketball Match Context event recognition methods of fusion domain knowledge and multistage depth characteristic
CN109241834A (en) * 2018-07-27 2019-01-18 中山大学 A kind of group behavior recognition methods of the insertion based on hidden variable
CN110348364A (en) * 2019-07-05 2019-10-18 北京工业大学 A kind of basketball video group behavior recognition methods that Unsupervised clustering is combined with time-space domain depth network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194322A (en) * 2017-04-28 2017-09-22 南京邮电大学 A kind of behavior analysis method in video monitoring scene
CN108681712A (en) * 2018-05-17 2018-10-19 北京工业大学 A kind of Basketball Match Context event recognition methods of fusion domain knowledge and multistage depth characteristic
CN109241834A (en) * 2018-07-27 2019-01-18 中山大学 A kind of group behavior recognition methods of the insertion based on hidden variable
CN110348364A (en) * 2019-07-05 2019-10-18 北京工业大学 A kind of basketball video group behavior recognition methods that Unsupervised clustering is combined with time-space domain depth network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HENG FU 等: "MF-SORT Simple Online and Realtime Tracking with Motion Features" *
周培培;丁庆海;罗海波;侯幸林;: "视频监控中的人群异常行为检测与定位" *
谭程午 等: "基于融合特征的群体行为识别" *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131944A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system
CN112131943A (en) * 2020-08-20 2020-12-25 深圳大学 Video behavior identification method and system based on dual attention model
CN112131944B (en) * 2020-08-20 2023-10-17 深圳大学 Video behavior recognition method and system
CN112131943B (en) * 2020-08-20 2023-07-11 深圳大学 Dual-attention model-based video behavior recognition method and system
CN112528785A (en) * 2020-11-30 2021-03-19 联想(北京)有限公司 Information processing method and device
CN113065451B (en) * 2021-03-29 2022-08-09 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113065451A (en) * 2021-03-29 2021-07-02 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113297936A (en) * 2021-05-17 2021-08-24 北京工业大学 Volleyball group behavior identification method based on local graph convolution network
CN113297936B (en) * 2021-05-17 2024-05-28 北京工业大学 Volleyball group behavior identification method based on local graph convolution network
CN113239828A (en) * 2021-05-20 2021-08-10 清华大学深圳国际研究生院 Face recognition method and device based on TOF camera module
CN113836992B (en) * 2021-06-15 2023-07-25 腾讯科技(深圳)有限公司 Label identification method, label identification model training method, device and equipment
CN113836992A (en) * 2021-06-15 2021-12-24 腾讯科技(深圳)有限公司 Method for identifying label, method, device and equipment for training label identification model
US20230161000A1 (en) * 2021-11-24 2023-05-25 Smart Radar System, Inc. 4-Dimensional Radar Signal Processing Apparatus
CN114187546A (en) * 2021-12-01 2022-03-15 山东大学 Combined action recognition method and system
CN114863356B (en) * 2022-03-10 2023-02-03 西南交通大学 Group activity identification method and system based on residual aggregation graph network
CN114863356A (en) * 2022-03-10 2022-08-05 西南交通大学 Group activity identification method and system based on residual aggregation graph network
CN117576784A (en) * 2024-01-15 2024-02-20 吉林大学 Method and system for recognizing diver gesture by fusing event and RGB data
CN117576784B (en) * 2024-01-15 2024-03-26 吉林大学 Method and system for recognizing diver gesture by fusing event and RGB data

Also Published As

Publication number Publication date
CN111401174B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN111401174B (en) Volleyball group behavior identification method based on multi-mode information fusion
Dai et al. Human action recognition using two-stream attention based LSTM networks
CN110472531B (en) Video processing method, device, electronic equipment and storage medium
CN107506740B (en) Human body behavior identification method based on three-dimensional convolutional neural network and transfer learning model
Srinivas et al. A taxonomy of deep convolutional neural nets for computer vision
Kae et al. Augmenting CRFs with Boltzmann machine shape priors for image labeling
CN110717431A (en) Fine-grained visual question and answer method combined with multi-view attention mechanism
CN112446476A (en) Neural network model compression method, device, storage medium and chip
CN109670576B (en) Multi-scale visual attention image description method
CN110348364B (en) Basketball video group behavior identification method combining unsupervised clustering and time-space domain depth network
CN111666919B (en) Object identification method and device, computer equipment and storage medium
Ren et al. Learning with weak supervision from physics and data-driven constraints
CN112668366B (en) Image recognition method, device, computer readable storage medium and chip
US20220383639A1 (en) System and Method for Group Activity Recognition in Images and Videos with Self-Attention Mechanisms
CN112036276A (en) Artificial intelligent video question-answering method
CN112712068B (en) Key point detection method and device, electronic equipment and storage medium
Huang et al. Vqabq: Visual question answering by basic questions
CN115223020B (en) Image processing method, apparatus, device, storage medium, and computer program product
CN113221663A (en) Real-time sign language intelligent identification method, device and system
Li et al. Modelling human body pose for action recognition using deep neural networks
CN113312973A (en) Method and system for extracting features of gesture recognition key points
CN112906520A (en) Gesture coding-based action recognition method and device
CN103500456A (en) Object tracking method and equipment based on dynamic Bayes model network
CN114821188A (en) Image processing method, training method of scene graph generation model and electronic equipment
Zhao et al. Research on human behavior recognition in video based on 3DCCA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant