CN113673489B - Video group behavior identification method based on cascade Transformer - Google Patents

Video group behavior identification method based on cascade Transformer Download PDF

Info

Publication number
CN113673489B
CN113673489B CN202111225547.8A CN202111225547A CN113673489B CN 113673489 B CN113673489 B CN 113673489B CN 202111225547 A CN202111225547 A CN 202111225547A CN 113673489 B CN113673489 B CN 113673489B
Authority
CN
China
Prior art keywords
layer
target
transformer
human body
layers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111225547.8A
Other languages
Chinese (zh)
Other versions
CN113673489A (en
Inventor
李玲
徐晓刚
王军
祝敏航
曹卫强
朱亚光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202111225547.8A priority Critical patent/CN113673489B/en
Publication of CN113673489A publication Critical patent/CN113673489A/en
Application granted granted Critical
Publication of CN113673489B publication Critical patent/CN113673489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of computer vision and deep learning, in particular to a video group behavior identification method based on a cascade Transformer, which comprises the steps of firstly collecting and generating a video data set, extracting three-dimensional space-time characteristics of the video data set through a three-dimensional backbone network, and selecting a key frame image space characteristic diagram; preprocessing the key frame image space characteristic graph, sending the preprocessed key frame image space characteristic graph into a human body target detection Transformer, and outputting a human body target frame in the key frame image; then, mapping sub-feature maps corresponding to the screened human body target frames on the key frame image feature map, calculating query/key/value by combining the surrounding frame feature maps of the key frame image, inputting a group behavior identification Transfomer, and outputting a group level space-time coding feature map; and finally, classifying the group behaviors through a multilayer perceptron. The method has the effect of effectively improving the group behavior recognition accuracy.

Description

Video group behavior identification method based on cascade Transformer
Technical Field
The invention relates to the field of computer vision and deep learning, in particular to a video group behavior identification method based on cascade transformers
Background
Nowadays, surveillance videos are widely applied to social public places, and play an extremely important role in maintaining social public safety. The abnormal behaviors and events in the monitoring video are effectively identified, and the effect of the monitoring video can be better played. The group behaviors are the most frequently-occurring human behavior activities in the video, and the group behavior recognition can effectively prevent dangerous events by automatically recognizing the group behaviors in the video, so that the method has wide application value.
In natural scenes, video group behavior identification mainly faces two major challenges. Firstly, the scene is complex, and the main manifestations are that human scale transformation is large, background illumination, mutual shielding among groups and the like cause the difficulty in extracting individual behavior characteristics to be increased; secondly, the hierarchical relationship between the individuals and the group is difficult to model, some individuals in the group behaviors have larger influence on the group behaviors, the contribution of some individuals is relatively small, the difference between the individuals increases the complexity of the context relationship between the individuals, and how to highlight the difference of the contribution of different individuals to the group behaviors is the key for effectively identifying the group behaviors.
Recent group behavior recognition methods are mostly realized based on deep learning and mainly divided into two types: firstly, extracting space-time characteristics by adopting a single-order three-dimensional convolution neural network model, and sending the space-time characteristics into a full-connection layer for group behavior identification; secondly, a two-step identification method is adopted, individual features are extracted in the first stage, the individual feature extraction mostly adopts a target detection algorithm to detect a human body target frame, then a three-dimensional convolution network is utilized to extract individual space-time features of the target frame, or a skeleton-based method is adopted to extract individual skeleton features of the target frame; and in the second stage, the hierarchical relationship between the individuals and the group is modeled, the relation between the individual characteristics extracted in the first stage is obtained, the group level characteristics are output and sent to a full connection layer for group behavior identification, and the stage mainly adopts a method based on a cyclic convolution network, a graph network or weighted fusion.
Patent CN110991375A discloses that a target loss function is constructed, and a single-order target deep neural network is constructed through a multi-channel encoder and a decoder to perform group behavior recognition, which has a disadvantage that a single-order network model cannot simultaneously extract individual and group features well, resulting in low recognition accuracy.
Patent CN111178323A discloses that a target detection algorithm SSD is used to extract a human body frame in each frame of video image, an open pos algorithm is used to extract a single individual bone feature, and then an artificial design method is used to fuse the individual bone features to extract a group representation feature. The method has the disadvantages that end-to-end training cannot be realized by target detection and skeleton extraction algorithms, and the two algorithms need to be sent into a group feature extraction network after offline fine-tuning training aiming at an actual use scene, so that the difficulty of actual application of the algorithms is increased; the group feature extraction depends on manual design, the group level space-time features cannot be effectively and automatically extracted, and researches show that the manually designed features are easily influenced by scenes and illumination and have poor robustness.
The patent CN110796081A discloses that firstly a human body target is detected by using a target detection network, single-frame human body target characteristics are extracted through a convolution network, then a graph model is constructed according to appearance and position relations among single individuals, single-frame group behavior representation characteristics are extracted by using a graph convolution neural network, and finally multi-frame group behavior characteristics are fused to obtain video group behavior representation characteristics. The method has the disadvantages that when the spatial features of the single-frame group are extracted by the graph convolution network, the individual features with discriminant in the group are not highlighted, and the video time sequence features cannot be extracted well only by simple weighted fusion in the video time feature dimension.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a video group behavior recognition method based on cascaded transformers, which is realized by a two-stage Transformer network, wherein a first-stage human target detection Transformer detects a human target frame and extracts individual characteristics with discriminant in a group, a second-stage group behavior recognition Transformer extracts single-frame spatial characteristics and inter-frame time sequence characteristics through a self-attention mechanism, effectively fuses the individual behavior characteristics to extract group level characteristics, and finally, a group behavior category is output through a multi-layer sensor, so that end-to-end training can be realized, and the specific technical scheme is as follows:
a video group behavior identification method based on a cascade Transformer comprises the following steps:
the method comprises the following steps: constructing a data set by utilizing an open-source shelving data set RWF-2000, network collection, local independent collection and marked monitoring video data, namely an original video sequence;
step two: inputting the original video sequence obtained in the step one to a backbone network, acquiring a time and space characteristic diagram at a convolutional layer Conv5 layer of the backbone network, selecting a video key frame image characteristic diagram, and preprocessing the key frame image characteristic diagram;
step three: transforming the scale of the key frame image feature graph preprocessed in the step two, inputting the transformed key frame image feature graph into a human body target detection Transformer encoder, extracting image coding features through a self-attention mechanism, sending the image coding features and learnable inquiry vectors into a human body target detection Transformer decoder, outputting target inquiry vectors, finally constructing a classification head and a regression head through a full connection layer and a plurality of layers of sensing machine layers, and respectively predicting and outputting a target classification confidence coefficient and a target frame image position coordinate;
step four: taking the key frame image feature map preprocessed in the step two, the target category confidence coefficient and the target frame image coordinate output in the step three as input, screening a human body target frame by using the target category confidence coefficient, mapping and screening a sub-feature map corresponding to the human body target frame on the feature map, preprocessing the sub-feature map to obtain a query, and linearly mapping frame images around the key frame to obtain a key and a value;
step five: sending the query, the key and the value into a group behavior recognition Transformer encoder module, outputting a group level space-time coding characteristic diagram, and outputting a group behavior recognition predicted value and a confidence coefficient thereof through a multi-layer perceptron layer;
step six: a loss function is constructed and a network model is trained.
Further, the first step specifically comprises: the method comprises the steps of utilizing an open source framing data set RWF-2000, network collection, locally and independently collected and labeled monitoring video data, cutting videos of the collected videos at 5 seconds according to a frame rate of 30 frames per second, deleting video segments containing noise and blurred pictures, labeling coordinates and width and height of the upper left corner of a target frame of a region where a human body is located in each frame of image of the videos and group behavior types, and constructing corresponding type data sets, wherein the behavior types comprise three types of framing, gathering and running.
Further, the second step specifically includes the following steps:
(2.1) selecting a 3D ResNet50 depth residual network as a backbone network, and performing downsampling on Conv2, Conv3, Conv4 and Conv5 layers of the 3D Rensnet50 in a space dimension only without performing downsampling on a time dimension to obtain a Conv5 layer feature map
Figure 952267DEST_PATH_IMAGE001
T represents a T frame;
(2.2) selecting the intermediate frame image in the time sequence as a key frame image to obtain a key frame image feature map
Figure 167610DEST_PATH_IMAGE002
And reducing the feature map channel dimensions using a 1 × 1 convolution, the new feature map being represented as
Figure 540822DEST_PATH_IMAGE003
Introducing a position-coding matrix
Figure 404873DEST_PATH_IMAGE004
The position coding adopts a two-dimensional sinusoidal coding mode to obtain a new characteristic diagram matrix of
Figure 812721DEST_PATH_IMAGE005
Further, the third step is specifically: outputting the characteristic diagram of the step (2.2)
Figure 104025DEST_PATH_IMAGE006
The wide-high matrix is changed into a one-dimensional vector, a new characteristic diagram is formed and input into a human body target detection Transformer encoder, and the characteristic diagram related to the image context is output after passing through 6 encoder layers of the human body target detection Transformer encoder
Figure 964533DEST_PATH_IMAGE007
(ii) a Then set a groupFixed learnable embedded target query vector, and its feature map
Figure 632275DEST_PATH_IMAGE007
The image feature classification method comprises the steps of inputting the image feature classification information into a human body target detection Transformer decoder, outputting a target query vector, namely a target prediction output number in parallel through 6 decoder layers according to the relation between context reasoning objects of image features, and sending the target query vector into a classification head and a target frame regression head, wherein the classification head is composed of a layer of full connection layer and outputs confidence degrees of two categories of a human body and a background, and the target frame regression head is composed of a layer of feedforward neural network and outputs position coordinate information of a target frame on an image.
Furthermore, the human body target DEtection Transformer encoder and the decoder both adopt a DEtection Transformer, namely an encoder and a decoder structure in DETR, the encoder comprises M encoder layers, and each encoder layer consists of 1 multi-head self-attention layer, 2 layer normalization layers and 1 feedforward neural network layer; the decoder comprises M decoder layers, and each decoder layer consists of 2 multi-head self-attention layers, 3 normalization layers and 1 feedforward neural network layer.
Further, the fourth step specifically includes the following steps:
(4.1) arranging the human body target frames output in the third step in a confidence degree descending order, selecting the first k human body target frames, mapping the k human body target frames through a RoiAlign algorithm, and outputting the characteristic diagram in the step (2.2)
Figure 407813DEST_PATH_IMAGE003
The corresponding sub-feature graph is obtained;
(4.2) converting the wide and high matrixes of the sub-feature maps into one-dimensional vectors to form new feature maps, adding a learnable position coding matrix, performing projection transformation on the learnable position coding matrix after layer normalization to obtain query, namely
Figure 870018DEST_PATH_IMAGE008
(4.3) outputting step (2.1)Conv5 layer profile
Figure 217823DEST_PATH_IMAGE009
Reducing the channel dimension through 1 × 1 convolution, then changing the feature width and height matrix into a one-dimensional vector to form a new feature map, and obtaining key, namely K, and value, namely V, after subsequent processing is consistent with query.
Further, the fifth step is specifically: will be provided with
Figure 423676DEST_PATH_IMAGE008
K and V are sent into a group behavior recognition Transformer encoder module, the encoder module has 3 layers, each layer has two heads side by side, each head is a group behavior recognition Transformer basic module, and the group behavior recognition Transformer basic module is to be used for carrying out group behavior recognition on the data of
Figure 212641DEST_PATH_IMAGE008
And K and V are sent to two heads in the first layer, two coding matrixes are output in parallel, the two output coding matrixes are connected to obtain the updated query of the first layer and are used as the input of the next layer, after the query passes through a 3-layer Transformer coding layer, a group-level space-time coding characteristic diagram is output, and finally the group-level space-time coding characteristic diagram is sent to a multi-layer perceptron layer to obtain a group behavior recognition predicted value and the confidence coefficient thereof.
Further, the fifth step includes the following steps:
(5.1) Using the output of step four
Figure 439223DEST_PATH_IMAGE008
K, calculating by dot product operation
Figure 274324DEST_PATH_IMAGE010
Layer one
Figure 18289DEST_PATH_IMAGE011
Individual head self-attention weight matrix
Figure 756700DEST_PATH_IMAGE012
(5.2) by the step of5.1) obtaining an attention weight matrix and a value matrix obtained in the fourth step, weighting and summing the summed result and the original result after passing through a dropout layer
Figure 560708DEST_PATH_IMAGE013
Are added to obtain
Figure 555209DEST_PATH_IMAGE014
Figure 430761DEST_PATH_IMAGE014
After layer normalization, residual error connection is carried out on the matrix which passes through two feedforward neural network layers and a dropout layer, and finally, updated matrix is obtained through a normalization layer
Figure 194317DEST_PATH_IMAGE015
(5.3) connecting the output of step (5.2)
Figure 559440DEST_PATH_IMAGE010
Layer per head output
Figure 978920DEST_PATH_IMAGE015
To obtain new
Figure 454900DEST_PATH_IMAGE016
Figure 276226DEST_PATH_IMAGE016
As
Figure 313714DEST_PATH_IMAGE017
Inputting layers, and iteratively calculating query updating values according to the step (5.2) until a final group level space-time coding characteristic diagram is obtained after three layers of transform coding layers are passed;
and (5.4) sending the coding feature map output in the step (5.3) into a multilayer perceptron layer to obtain a group behavior recognition predicted value and a confidence coefficient thereof.
Further, the loss function includes: loss of binary class
Figure 220490DEST_PATH_IMAGE018
Regression loss
Figure 234583DEST_PATH_IMAGE019
And multi-classification loss
Figure 175994DEST_PATH_IMAGE020
The weighted sum is performed on each loss, and the weight of each loss function is adjusted by using the hyper-parameters alpha, beta and gamma to obtain the total loss
Figure 617339DEST_PATH_IMAGE021
Figure 11412DEST_PATH_IMAGE022
Wherein alpha, beta and gamma are weights,
Figure 94774DEST_PATH_IMAGE023
is an indicator function when
Figure 625113DEST_PATH_IMAGE024
The time is 1, otherwise, the time is 0; the two classification losses
Figure 4404DEST_PATH_IMAGE025
The predicted value output by the classification head and the real value of the target frame on the matching are calculated, and the regression loss is obtained
Figure 620193DEST_PATH_IMAGE026
The multi-classification loss is obtained by calculating the position predicted value of the regression head target frame and the real value of the matched target frame
Figure 241667DEST_PATH_IMAGE027
The real label value is calculated and obtained by the output predicted value and the real label value of a multi-layer perceptron layer, namely a multi-classification head.
Further, the network model is trained to initialize a human target detection Transformer, and the network model is used for training a human target detection TransformerThe human body target DEtection Transformer adopts DEtection Transformer namely DETR, 2D rest 50 is used in DETR, and the parameter value of 2D rest 50 is repeated T times in the time dimension, so that an RGB image is formed
Figure 157670DEST_PATH_IMAGE028
Become into
Figure 940819DEST_PATH_IMAGE029
Initializing, carrying out back propagation on the network based on the loss function, continuously updating network parameters through a batch gradient descent method, and enabling the model to achieve convergence after batch training.
Compared with the prior art, the invention has the beneficial effects that:
(1) a video group behavior recognition model based on cascade transformers is designed, the human body target detection transformers and the group behavior recognition transformers are combined to achieve end-to-end training, manual feature extraction and off-line training are avoided, and complexity of an algorithm is reduced.
(2) The improved three-dimensional convolutional neural network effectively extracts a space-time characteristic diagram, and a high-confidence human body target frame of transform regression is detected by combining a first-stage human body target, and a human body target frame sub-characteristic diagram is mapped on the characteristic diagram, so that the second-stage network focuses on human body behavior characteristics, background noise interference is avoided, and the algorithm has robustness to complex scenes.
(3) The group behavior recognition Transformer distinguishes the individual contribution degree in the group through a self-attention mechanism of multiple layers and multiple heads and self-attention weight calculation, realizes the characteristic fusion of space and time context relations among complex individuals and effectively improves the group behavior recognition accuracy. The method achieves 92.3% of accuracy on the RWF-2000 framing verification data set after the human body target frame is marked again, and effectively improves the accuracy.
(4) The method can effectively identify the group behaviors in the video, prevents dangerous events, has wide application value, is suitable for video monitoring in indoor and outdoor complex scenes, and is particularly suitable for identifying the group behaviors of fighting, running and gathering.
Drawings
FIG. 1 is a data set generation flow diagram of the present invention;
FIG. 2 is a flow chart of a video group behavior recognition method based on a cascade Transformer according to the present invention;
FIG. 3 is a diagram of a human target detection Transformer network architecture according to the present invention;
FIG. 4 is a diagram of a population behavior recognition Transformer network architecture according to the present invention;
FIG. 5 is a schematic diagram of the basic module of the crowd behavior recognition transform encoder layer according to the present invention.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.
As shown in fig. 2, a video group behavior recognition method based on a cascade Transformer includes firstly, collecting and generating a video data set, extracting three-dimensional space-time characteristics of the video data set through a three-dimensional backbone network, and selecting a key frame image space characteristic diagram; preprocessing the key frame image space characteristic graph, sending the preprocessed key frame image space characteristic graph into a human body target detection Transformer, and outputting a human body target frame in the key frame image; then, mapping sub-feature maps corresponding to the screened human body target frames on the key frame image feature map, calculating query/key/value by combining the surrounding frame feature maps of the key frame image, inputting a group behavior identification Transfomer, and outputting a group level space-time coding feature map; and finally, classifying the group behaviors through a multilayer perceptron. The method specifically comprises the following steps:
the method comprises the following steps: the collection generates a video data set. And constructing a data set by using an open-source open-shelf data set RWF-2000, network collection and monitoring video data which are locally and independently acquired and labeled, and dividing the data set into a training set and a test set according to a ratio of 4: 1. Specifically, as shown in fig. 1, a network first collects videos, inputs behavior category keywords into a website and downloads related videos, and replaces keyword language categories to search repeatedly for data diversity; secondly, screening videos collected by a network and independently collected by a local camera, and deleting videos irrelevant to repetition and behavior; cutting the video in 5 seconds according to the frame rate of 30 frames per second, and deleting video segments containing noise and blurred pictures; and finally, labeling coordinates, width and height of the upper left corner of a target frame of the region where the human body is located in each frame of image of the video and group behavior categories, and constructing a corresponding category data set, wherein the behavior categories are divided into three categories of fighting, gathering and running.
And step two and step three are shown in figure 3, time and space characteristic graphs are extracted from an original video sequence through a three-dimensional convolutional neural network, the space characteristic graphs of key frame images are selected, position coding information is added to the space characteristic graphs to synthesize new embedded vectors, the embedded vectors are input into a human body target detection transform coder after scale transformation, image coding characteristics are extracted through a self-attention mechanism, the image coding characteristics and learnable query vectors are sent into a human body target detection transform decoder, target query vectors are output, finally a classification head and a regression head are constructed through a full-connection layer and a multi-layer perceptron layer, and a target frame classification confidence coefficient and a target frame image coordinate are respectively predicted.
The human body target DEtection Transformer encoder and decoder both adopt a DEtection Transformer, namely an encoder and decoder structure in DETR, and comprise M encoder layers, wherein each encoder layer consists of 1 multi-head self-attention layer, 2 layer normalization layers and 1 feedforward neural network layer; the decoder comprises M decoder layers, and each decoder layer consists of 2 multi-head self-attention layers, 3 normalization layers and 1 feedforward neural network layer. M =6 in this example.
Wherein, the second step is: inputting an original video sequence to a backbone network, acquiring a time and space characteristic diagram at a Conv5 layer of the backbone network, selecting a video key frame image characteristic diagram, and preprocessing the key frame image characteristic diagram; specifically, the method comprises the following steps:
and (2.1) the backbone network acquires the image sequence time-space characteristic map. Input of original video sequence
Figure 309483DEST_PATH_IMAGE030
Representing RGB image with height and width of T frame H × W, selecting 3D ResNet50 depth residualThe network is a backbone network, in order to represent more detailed inter-frame motion information, Conv2, Conv3, Conv4 and Conv5 layers of the 3D Rensnet50 are not downsampled in a time dimension, and are downsampled only in a space dimension, so that a Conv5 layer feature diagram can be obtained
Figure 469069DEST_PATH_IMAGE031
Figure 239579DEST_PATH_IMAGE032
Conv5 layer feature map size of
Figure 695093DEST_PATH_IMAGE033
Where C =2048 denotes the channel dimension.
And (2.2) key frame image feature maps and preprocessing thereof. Selecting the intermediate frame image of the time sequence as a key frame image, and acquiring a key frame image feature map
Figure 551054DEST_PATH_IMAGE034
And reducing the channel dimension of the feature map by using 1 × 1 convolution to reduce the complexity of the operation, the new feature map is represented as
Figure 248751DEST_PATH_IMAGE035
Since the Transformer can not represent the position relation, a position coding matrix is introduced
Figure 873768DEST_PATH_IMAGE036
And the position coding adopts a two-dimensional sinusoidal coding mode. Setting a new characteristic diagram matrix as
Figure 998719DEST_PATH_IMAGE037
Figure 76396DEST_PATH_IMAGE038
Where d =256 represents the channel dimension after dimensionality reduction.
The third step is: in the encoding stage, the width and height matrixes of the characteristic diagrams output in the step (2.2) are changed into one-dimensional vectors to form the size of the vectors
Figure 843364DEST_PATH_IMAGE039
Is input into the first layer of the encoder of the DETR, and the feature map related to the image context is output after passing through the 6 encoder layers
Figure 322887DEST_PATH_IMAGE040
Wherein N represents the number of human target objects needing to be detected in the image; in the decoding stage, a fixed set of learnable embedded object query (object query) vectors is preset, and the feature map is output in the encoding stage
Figure 385783DEST_PATH_IMAGE041
Inputting the data into the first layer of a decoder, passing through 6 decoder layers, outputting N target query vectors in parallel by a transform decoder according to the relation between context reasoning objects of image features, and sending the N target query vectors to a classification head and a target frame regression head, wherein the classification head consists of a layer of full connection layer and outputs confidence degrees of two categories of human body and background, and the target frame regression head consists of a layer of feedforward neural network and outputs coordinate information of a target frame on an image
Figure 950756DEST_PATH_IMAGE042
Wherein
Figure 255836DEST_PATH_IMAGE043
The coordinates of the center point of the target frame are represented,
Figure 589865DEST_PATH_IMAGE044
representing the target box width and height.
Step four and step five as shown in fig. 4, the key frame image feature map output in step (2.2), the category confidence coefficient output in step three and the coordinates of the target frame on the image are used as input, query, key and value are calculated and sent to a group behavior recognition Transformer encoder module, the encoder module has 3 layers, 2 heads are arranged side by side on each layer, the feature map related to the image context is output, and finally, the feature map outputs the category and the confidence coefficient of the group behavior through a multilayer perceptron layer.
Wherein, the fourth step is: and (3) carrying out the step (2.2) Output key frame feature map
Figure 322198DEST_PATH_IMAGE045
And thirdly, the confidence coefficient of the target category and the coordinates of the target frame on the image are used as input, the confidence coefficient of the target category is used for screening the human body target frame, and the human body target frame is displayed on the feature map
Figure 374467DEST_PATH_IMAGE045
Sub-feature maps corresponding to the human body target frames are mapped and screened, the sub-feature maps are preprocessed to obtain a query, and key and value are obtained by performing linear mapping on frame images around the key frame; specifically, the method comprises the following steps:
and (4.1) mapping the human body target characteristic map.
Arranging the human body target frames output in the step three in a confidence degree descending order, selecting the first k human body target frames, mapping the k human body target frames through a RoiAlign algorithm, and outputting the characteristic diagram in the step (2.2)
Figure 217659DEST_PATH_IMAGE045
Sub-feature map corresponding to the above
Figure 468511DEST_PATH_IMAGE046
Wherein
Figure 247111DEST_PATH_IMAGE047
The feature map is represented by height and width, and the channel dimensions d =256 and k = 30.
And (4.2) calculating query, key and value.
Sub-feature map
Figure 412776DEST_PATH_IMAGE048
The width and height matrix is changed into a one-dimensional vector with the size of
Figure 935024DEST_PATH_IMAGE049
Adding a learnable position coding matrix into the characteristic diagram, and performing projection transformation through the learnable projection matrix after layer normalization to obtain query Q; the specific expression is as follows:
Figure 368279DEST_PATH_IMAGE050
where LN () represents the layer normalization,
Figure 317781DEST_PATH_IMAGE051
and A represents the number of the self-attention heads,
Figure 203697DEST_PATH_IMAGE052
l represents the number of the transform encoder modules,
Figure 529636DEST_PATH_IMAGE053
the representation may be a learnable projection matrix,
Figure 817398DEST_PATH_IMAGE054
representing a learnable encoding matrix;
(4.3) outputting the Conv5 layer characteristic diagram output by the step (2.1)
Figure 937801DEST_PATH_IMAGE055
Reducing the channel dimension to 256 by 1 × 1 convolution, and then changing the characteristic width and height matrix into a one-dimensional vector to form a vector with the size of
Figure 812478DEST_PATH_IMAGE056
Characteristic diagram of
Figure 676529DEST_PATH_IMAGE057
Subsequent processing is consistent with the query, and key, namely K, and value, namely V are obtained; the specific expression is as follows:
Figure 818798DEST_PATH_IMAGE058
wherein
Figure 375681DEST_PATH_IMAGE059
The representation may be a learnable projection matrix,
Figure 236189DEST_PATH_IMAGE060
representing a learnable encoding matrix.
The fifth step is: will be provided with
Figure 169510DEST_PATH_IMAGE008
Sending K and V into a group behavior recognition Transformer encoder module, calculating updated query through each layer of each head, wherein each head is a Transformer base module, and sending K and V into the group behavior recognition Transformer encoder module
Figure 900706DEST_PATH_IMAGE008
K and V are sent to two heads in the first layer, two coding matrixes are output in parallel, the two output coding matrixes are connected to obtain the updated query of the first layer and are used as the input of the next layer, after passing through a 3-layer Transformer coding layer, a group-level space-time coding characteristic diagram is output, and finally the group-level space-time coding characteristic diagram is sent to a multi-layer perceptron layer to obtain a group behavior recognition prediction value and a confidence coefficient thereof; specifically, the method comprises the following steps:
(5.1) self-attention calculation. Obtained by the fourth step
Figure 628491DEST_PATH_IMAGE008
K, calculating by dot product operation
Figure 466041DEST_PATH_IMAGE061
Layer one
Figure 671895DEST_PATH_IMAGE062
Individual head self-attention weight matrix
Figure 788755DEST_PATH_IMAGE063
The specific expression is as follows:
Figure 687441DEST_PATH_IMAGE064
where SM () represents the softmax activation function,
Figure 256963DEST_PATH_IMAGE065
representing the dimension of each attention head, D being the dimension of a key.
(5.2) the first
Figure 266507DEST_PATH_IMAGE061
Layer one
Figure 503453DEST_PATH_IMAGE062
Individual head coding feature map
Figure 307461DEST_PATH_IMAGE066
And (4) calculating. As shown in fig. 5, the attention weight matrix obtained in step (5.1) and the value matrix obtained in step four are weighted and summed, and the summation result is processed through a dropout layer and then is compared with the original value
Figure 131323DEST_PATH_IMAGE067
Are added to obtain
Figure 475717DEST_PATH_IMAGE068
Figure 442536DEST_PATH_IMAGE068
After layer normalization, residual error connection is carried out on the matrix which passes through two feedforward neural network layers and a dropout layer, and finally, updated matrix is obtained through a normalization layer
Figure 542079DEST_PATH_IMAGE069
This can be achieved by the following equation:
Figure 227138DEST_PATH_IMAGE070
Figure 703119DEST_PATH_IMAGE071
where FFN () represents the feed-forward neural network layer.
And (5.3) calculating an encoding feature map.
Obtained by step (5.2)
Figure 524444DEST_PATH_IMAGE069
Then, connecting the output of each head of the layer to obtain
Figure 60468DEST_PATH_IMAGE072
Figure 967244DEST_PATH_IMAGE073
As
Figure 482801DEST_PATH_IMAGE074
Inputting layer, calculating query updating value according to step (5.2), and obtaining final group level space-time coding characteristic diagram after passing through 3 layers of transform coding layer
Figure 424212DEST_PATH_IMAGE075
(5.4) group behavior category and its confidence.
Sending the group level space-time coding feature map output in the step (5.3) into a multilayer perceptron layer to obtain a group behavior recognition predicted value and confidence thereof, wherein the expression is
Figure 865558DEST_PATH_IMAGE076
And y represents a group behavior recognition prediction value.
Step six: and constructing a loss function and training a model. The entire network contains three losses: loss of binary class
Figure 994051DEST_PATH_IMAGE077
Regression loss
Figure 811834DEST_PATH_IMAGE078
And multi-classification loss
Figure 607752DEST_PATH_IMAGE079
Calculating the two classification losses according to the output predicted value of the classification head constructed in the step three and the matched target true value
Figure 485578DEST_PATH_IMAGE077
Let y denote the set of target truth values,
Figure 366947DEST_PATH_IMAGE080
representing a target predicted value, wherein N represents the predicted output number of the target object, taking N =50 in the embodiment, and using the set y as the predicted output value is greater than the true value in the image
Figure 224306DEST_PATH_IMAGE081
The filling-up is carried out completely,
Figure 874731DEST_PATH_IMAGE081
showing no target, matching the predicted value and the true value by using the Hungarian algorithm, and calculating the loss between the matched predicted value and the true value
Figure 923458DEST_PATH_IMAGE082
Figure 292122DEST_PATH_IMAGE083
Wherein
Figure 123812DEST_PATH_IMAGE084
Represents the ith target real tag value,
Figure 956639DEST_PATH_IMAGE085
indicating that the subscript of the predicted value corresponding to the ith true value is matched using the hungarian algorithm,
Figure 176268DEST_PATH_IMAGE086
indicating that the predictor matching the ith truth belongs to a category
Figure 828966DEST_PATH_IMAGE084
The probability of (c).
Outputting the position predicted value and the matching of the target frame according to the regression head constructed in the step threeTarget frame true value calculation regression loss
Figure 700232DEST_PATH_IMAGE087
Figure 121986DEST_PATH_IMAGE088
Figure 512516DEST_PATH_IMAGE089
Figure 590194DEST_PATH_IMAGE090
Wherein
Figure 357162DEST_PATH_IMAGE091
Represents the ith target real regression box position,
Figure 836684DEST_PATH_IMAGE092
represents a predicted regression box position value matching the ith true value, Area () represents a target box Area,
Figure 398116DEST_PATH_IMAGE093
and
Figure 963089DEST_PATH_IMAGE094
is a hyper-parameter, in this embodiment
Figure 504054DEST_PATH_IMAGE095
And
Figure 103663DEST_PATH_IMAGE096
calculating multi-classification loss according to the multi-classification head output predicted value and the real label value in the step (5.4)
Figure 835996DEST_PATH_IMAGE097
Figure 888265DEST_PATH_IMAGE098
Where K represents the number of categories of behavior,
Figure 465877DEST_PATH_IMAGE099
a real label representing the category of the behavior,
Figure 919992DEST_PATH_IMAGE100
indicates a predicted value of
Figure 823226DEST_PATH_IMAGE099
The probability of (c).
Carrying out weighted summation on each loss, and using the hyperparameters alpha, beta and gamma to adjust the weight of each loss function to obtain the total loss
Figure 362792DEST_PATH_IMAGE101
Figure 511139DEST_PATH_IMAGE102
Where α, β, and γ are weights, α =1, β =1, and γ =0.5 in this embodiment.
Figure 819760DEST_PATH_IMAGE103
Is an indicator function when
Figure 893896DEST_PATH_IMAGE104
The value is 1 when the value is 1, and is 0 when the value is not 0.
And (3) initializing the human body target detection Transformer in the step three by using a pre-training model of DETR on COCO, so that the network has prior knowledge, and the conditions that the loss is too large in the initial training stage and the model is difficult to converge are avoided. Since the backbone network in this embodiment uses 3D resnet50 and the DETR uses 2D resnet50, by repeating the 2D resnet50 parameter value T times in the time dimension, the system makes the backbone network have the same service as the core network
Figure 655178DEST_PATH_IMAGE105
Become into
Figure 105751DEST_PATH_IMAGE106
Initialization is performed. The network is reversely propagated based on the loss function, network parameters are continuously updated through a batch gradient descent method, and the model achieves convergence after 10 ten thousand batch training.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims (10)

1. A video group behavior identification method based on a cascade Transformer is characterized by comprising the following steps:
the method comprises the following steps: constructing a data set by utilizing an open-source shelving data set RWF-2000, network collection, local independent collection and marked monitoring video data, namely an original video sequence;
step two: inputting the original video sequence obtained in the step one to a backbone network, acquiring a time and space characteristic diagram at a convolutional layer Conv5 layer of the backbone network, selecting a video key frame image characteristic diagram, and preprocessing the key frame image characteristic diagram;
step three: transforming the scale of the key frame image feature graph preprocessed in the step two, inputting the transformed key frame image feature graph into a human body target detection Transformer encoder, extracting image coding features through a self-attention mechanism, sending the image coding features and learnable inquiry vectors into a human body target detection Transformer decoder, outputting target inquiry vectors, finally constructing a classification head and a regression head through a full connection layer and a plurality of layers of sensing machine layers, and respectively predicting and outputting a target classification confidence coefficient and a target frame image position coordinate;
step four: taking the key frame image feature map preprocessed in the step two, the target category confidence coefficient and the target frame image coordinate output in the step three as input, screening a human body target frame by using the target category confidence coefficient, mapping and screening a sub-feature map corresponding to the human body target frame on the feature map, preprocessing the sub-feature map to obtain a query, and linearly mapping frame images around the key frame to obtain a key and a value;
step five: sending the query, the key and the value into a group behavior recognition Transformer encoder module, outputting a group level space-time coding characteristic diagram, and outputting a group behavior recognition predicted value and a confidence coefficient thereof through a multilayer perceptron layer;
step six: a loss function is constructed and a network model is trained.
2. The method for video group behavior recognition based on cascaded transformers according to claim 1, wherein the first step is specifically: the method comprises the steps of utilizing an open source framing data set RWF-2000, network collection, locally and independently collected and labeled monitoring video data, cutting videos of the collected videos at 5 seconds according to a frame rate of 30 frames per second, deleting video segments containing noise and blurred pictures, labeling coordinates and width and height of the upper left corner of a target frame of a region where a human body is located in each frame of image of the videos and group behavior types, and constructing corresponding type data sets, wherein the behavior types comprise three types of framing, gathering and running.
3. The method for video group behavior recognition based on a cascade Transformer as claimed in claim 1, wherein the second step specifically comprises the following steps:
(2.1) selecting a 3D ResNet50 depth residual network as a backbone network, and performing downsampling on Conv2, Conv3, Conv4 and Conv5 layers of the 3D Rensnet50 in a space dimension only without performing downsampling on a time dimension to obtain a Conv5 layer feature map
Figure DEST_PATH_IMAGE001
T represents a T frame;
(2.2) selecting the intermediate frame image in the time sequence as a key frame image to obtain a key frame image feature map
Figure DEST_PATH_IMAGE002
And reducing the feature map channel dimensions using a 1 × 1 convolution, the new feature map being represented as
Figure DEST_PATH_IMAGE003
Introducing a position-coding matrix
Figure DEST_PATH_IMAGE004
The position coding adopts a two-dimensional sinusoidal coding mode to obtain a new characteristic diagram matrix of
Figure DEST_PATH_IMAGE005
4. The method for video group behavior recognition based on a cascade Transformer according to claim 3, wherein the third step is specifically: outputting the characteristic diagram of the step (2.2)
Figure DEST_PATH_IMAGE006
The wide-high matrix is changed into a one-dimensional vector, a new characteristic diagram is formed and input into a human body target detection Transformer encoder, and the characteristic diagram related to the image context is output after passing through 6 encoder layers of the human body target detection Transformer encoder
Figure DEST_PATH_IMAGE007
(ii) a Then a fixed set of learnable embedded target query vectors is set and is matched with the feature map
Figure 426312DEST_PATH_IMAGE007
Inputting the data into a human body target detection Transformer decoder, passing through 6 decoder layers, outputting N target query vectors, namely target prediction output number in parallel according to the relation between context reasoning objects of image characteristics by the human body target detection Transformer decoder, and sending the data into a classification head and a target frame regression head, wherein the classification head is composed of a layer of full connection layer and outputs confidence coefficients of two categories of human body and background, and the target frame regression headThe frame regression head is composed of a layer of feedforward neural network and outputs position coordinate information of the target frame on the image.
5. The method according to claim 4, wherein the human target DEtection Transformer encoder and decoder both adopt a DEtection Transformer (DETR) encoder and decoder structure, the encoder comprises M encoder layers, each encoder layer comprises 1 multi-headed self-attention layer, 2 layer normalization layers and 1 feedforward neural network layer; the decoder comprises M decoder layers, and each decoder layer consists of 2 multi-head self-attention layers, 3 normalization layers and 1 feedforward neural network layer.
6. The cascade Transformer-based video group behavior recognition method according to claim 4, wherein the fourth step specifically comprises the following steps:
(4.1) arranging the human body target frames output in the third step in a confidence degree descending order, selecting the first k human body target frames, mapping the k human body target frames through a RoiAlign algorithm, and outputting the characteristic diagram in the step (2.2)
Figure DEST_PATH_IMAGE008
The corresponding sub-feature graph is obtained;
(4.2) converting the wide and high matrixes of the sub-feature maps into one-dimensional vectors to form new feature maps, adding a learnable position coding matrix, performing projection transformation on the learnable position coding matrix after layer normalization to obtain query, namely
Figure DEST_PATH_IMAGE009
(4.3) outputting the Conv5 layer characteristic diagram output by the step (2.1)
Figure DEST_PATH_IMAGE010
Reducing channel dimension by 1 × 1 convolution, changing the feature width and height matrix into one-dimensional vector to form new feature diagram, and processing the new feature diagramConsistent with query, obtain key, i.e., K, and value, i.e., V.
7. The method for video group behavior recognition based on cascade Transformer as claimed in claim 6, wherein the fifth step is specifically: will be provided with
Figure DEST_PATH_IMAGE011
K and V are sent into a group behavior recognition Transformer encoder module, the encoder module has 3 layers, each layer has two heads side by side, each head is a group behavior recognition Transformer basic module, and the group behavior recognition Transformer basic module is to be used for carrying out group behavior recognition on the data of
Figure 189125DEST_PATH_IMAGE011
And K and V are sent to two heads in the first layer, two coding matrixes are output in parallel, the two output coding matrixes are connected to obtain the updated query of the first layer and are used as the input of the next layer, after the query passes through a 3-layer Transformer coding layer, a group-level space-time coding characteristic diagram is output, and finally the group-level space-time coding characteristic diagram is sent to a multi-layer perceptron layer to obtain a group behavior recognition predicted value and the confidence coefficient thereof.
8. The method for identifying video group behaviors based on cascade Transformer as claimed in claim 7, wherein the fifth step comprises the following steps:
(5.1) Using the output of step four
Figure 437703DEST_PATH_IMAGE011
K, calculating by dot product operation
Figure DEST_PATH_IMAGE012
Layer one
Figure DEST_PATH_IMAGE013
Individual head self-attention weight matrix
Figure DEST_PATH_IMAGE014
(5.2) weighting and summing the attention weight matrix obtained in the step (5.1) and the value matrix obtained in the step four, and enabling the summation result to pass through a dropout layer and then be compared with the original result
Figure DEST_PATH_IMAGE015
Are added to obtain
Figure DEST_PATH_IMAGE016
Figure 144497DEST_PATH_IMAGE016
After layer normalization, residual error connection is carried out on the matrix which passes through two feedforward neural network layers and a dropout layer, and finally, updated matrix is obtained through a normalization layer
Figure DEST_PATH_IMAGE017
(5.3) connecting the output of step (5.2)
Figure 896553DEST_PATH_IMAGE012
Layer per head output
Figure 187857DEST_PATH_IMAGE017
To obtain new
Figure DEST_PATH_IMAGE018
Figure 392573DEST_PATH_IMAGE018
As
Figure DEST_PATH_IMAGE019
Inputting layers, and iteratively calculating query updating values according to the step (5.2) until a final group level space-time coding characteristic diagram is obtained after three layers of transform coding layers are passed;
and (5.4) sending the group level space-time coding feature diagram output in the step (5.3) to a multilayer perceptron layer to obtain a group behavior recognition prediction value and confidence thereof.
9. The cascade Transformer-based video population behavior identification method according to claim 1, wherein the loss function comprises: loss of binary class
Figure DEST_PATH_IMAGE020
Regression loss
Figure DEST_PATH_IMAGE021
And multi-classification loss
Figure DEST_PATH_IMAGE022
The weighted sum is performed on each loss, and the weight of each loss function is adjusted by using the hyper-parameters alpha, beta and gamma to obtain the total loss
Figure DEST_PATH_IMAGE023
Figure DEST_PATH_IMAGE024
Wherein alpha, beta and gamma are weights,
Figure DEST_PATH_IMAGE025
is an indicator function when
Figure DEST_PATH_IMAGE026
The time is 1, otherwise, the time is 0; the two classification losses
Figure DEST_PATH_IMAGE027
The predicted value output by the classification head and the real value of the target frame on the matching are calculated, and the regression loss is obtained
Figure DEST_PATH_IMAGE028
The target frame position predicted value of the regression head and the matched target frameTrue value calculation, said multi-classification loss
Figure DEST_PATH_IMAGE029
The real label value is calculated and obtained by the output predicted value and the real label value of a multi-layer perceptron layer, namely a multi-classification head.
10. The method as claimed in claim 1, wherein the network model is trained to initialize a human target DEtection Transformer, the human target DEtection Transformer adopts a DEtection Transformer (DETR) which uses 2D net50, and the RGB image is made by repeating the parameter value of 2D net 50T times in the time dimension
Figure DEST_PATH_IMAGE030
Become into
Figure DEST_PATH_IMAGE031
Initializing, carrying out back propagation on the network based on the loss function, continuously updating network parameters through a batch gradient descent method, and enabling the model to achieve convergence after batch training.
CN202111225547.8A 2021-10-21 2021-10-21 Video group behavior identification method based on cascade Transformer Active CN113673489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111225547.8A CN113673489B (en) 2021-10-21 2021-10-21 Video group behavior identification method based on cascade Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111225547.8A CN113673489B (en) 2021-10-21 2021-10-21 Video group behavior identification method based on cascade Transformer

Publications (2)

Publication Number Publication Date
CN113673489A CN113673489A (en) 2021-11-19
CN113673489B true CN113673489B (en) 2022-04-08

Family

ID=78550756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111225547.8A Active CN113673489B (en) 2021-10-21 2021-10-21 Video group behavior identification method based on cascade Transformer

Country Status (1)

Country Link
CN (1) CN113673489B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114153973A (en) * 2021-12-07 2022-03-08 内蒙古工业大学 Mongolian multi-mode emotion analysis method based on T-M BERT pre-training model
CN113888541B (en) * 2021-12-07 2022-03-25 南方医科大学南方医院 Image identification method, device and storage medium for laparoscopic surgery stage
CN114170558A (en) * 2021-12-14 2022-03-11 北京有竹居网络技术有限公司 Method, system, device, medium and article for video processing
CN113936339B (en) * 2021-12-16 2022-04-22 之江实验室 Fighting identification method and device based on double-channel cross attention mechanism
CN114339403B (en) * 2021-12-31 2023-03-28 西安交通大学 Video action fragment generation method, system, equipment and readable storage medium
CN114973049B (en) * 2022-01-05 2024-04-26 上海人工智能创新中心 Lightweight video classification method with unified convolution and self-attention
CN114898241B (en) * 2022-02-21 2024-04-30 上海科技大学 Video repetitive motion counting system based on computer vision
CN114863356B (en) * 2022-03-10 2023-02-03 西南交通大学 Group activity identification method and system based on residual aggregation graph network
CN114758360B (en) * 2022-04-24 2023-04-18 北京医准智能科技有限公司 Multi-modal image classification model training method and device and electronic equipment
CN114648723A (en) * 2022-04-28 2022-06-21 之江实验室 Action normative detection method and device based on time consistency comparison learning
CN115169673A (en) * 2022-07-01 2022-10-11 扬州大学 Intelligent campus epidemic risk monitoring and early warning system and method
CN114863352B (en) * 2022-07-07 2022-09-30 光谷技术有限公司 Personnel group behavior monitoring method based on video analysis
CN115171029B (en) * 2022-09-09 2022-12-30 山东省凯麟环保设备股份有限公司 Unmanned-driving-based method and system for segmenting instances in urban scene
CN115761444B (en) * 2022-11-24 2023-07-25 张栩铭 Training method of incomplete information target recognition model and target recognition method
CN116246338B (en) * 2022-12-20 2023-10-03 西南交通大学 Behavior recognition method based on graph convolution and transducer composite neural network
CN116402811B (en) * 2023-06-05 2023-08-18 长沙海信智能系统研究院有限公司 Fighting behavior identification method and electronic equipment
CN116542290B (en) * 2023-06-25 2023-09-08 城云科技(中国)有限公司 Information prediction model construction method, device and application based on multi-source multi-dimensional data
CN116958739A (en) * 2023-06-25 2023-10-27 南京矩视科技有限公司 Attention mechanism-based carbon fiber channel real-time dynamic numbering method
CN116978051A (en) * 2023-08-03 2023-10-31 杭州海量信息技术有限公司 Method and device for extracting key information of form image
CN116895038B (en) * 2023-09-11 2024-01-26 中移(苏州)软件技术有限公司 Video motion recognition method and device, electronic equipment and readable storage medium
CN117496323B (en) * 2023-12-27 2024-03-29 泰山学院 Multi-scale second-order pathological image classification method and system based on transducer

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348312A (en) * 2019-06-14 2019-10-18 武汉大学 A kind of area video human action behavior real-time identification method
CN111460889A (en) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 Abnormal behavior identification method, device and equipment based on voice and image characteristics

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426820B (en) * 2015-11-03 2018-09-21 中原智慧城市设计研究院有限公司 More people's anomaly detection methods based on safety monitoring video data
CN105574489B (en) * 2015-12-07 2019-01-11 上海交通大学 Based on the cascade violence group behavior detection method of level
WO2017168889A1 (en) * 2016-04-01 2017-10-05 Yamaha Hatsudoki Kabushiki Kaisha Object detection device and vehicle having the object detection device
CN108805080A (en) * 2018-06-12 2018-11-13 上海交通大学 Multi-level depth Recursive Networks group behavior recognition methods based on context
US10726302B2 (en) * 2018-11-29 2020-07-28 Qualcomm Incorporated Edge computing
CN112131943B (en) * 2020-08-20 2023-07-11 深圳大学 Dual-attention model-based video behavior recognition method and system
CN112149563A (en) * 2020-09-23 2020-12-29 中科人工智能创新技术研究院(青岛)有限公司 Method and system for estimating postures of key points of attention mechanism human body image
CN112861691B (en) * 2021-01-29 2022-09-09 中国科学技术大学 Pedestrian re-identification method under occlusion scene based on part perception modeling
CN113486708B (en) * 2021-05-24 2022-03-25 浙江大华技术股份有限公司 Human body posture estimation method, model training method, electronic device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110348312A (en) * 2019-06-14 2019-10-18 武汉大学 A kind of area video human action behavior real-time identification method
CN111460889A (en) * 2020-02-27 2020-07-28 平安科技(深圳)有限公司 Abnormal behavior identification method, device and equipment based on voice and image characteristics

Also Published As

Publication number Publication date
CN113673489A (en) 2021-11-19

Similar Documents

Publication Publication Date Title
CN113673489B (en) Video group behavior identification method based on cascade Transformer
CN110119703B (en) Human body action recognition method fusing attention mechanism and spatio-temporal graph convolutional neural network in security scene
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN106650653B (en) Construction method of human face recognition and age synthesis combined model based on deep learning
Zhou et al. Activity analysis, summarization, and visualization for indoor human activity monitoring
CN111310707A (en) Skeleton-based method and system for recognizing attention network actions
CN107818307B (en) Multi-label video event detection method based on LSTM network
CN113749657B (en) Brain electricity emotion recognition method based on multi-task capsule
Theodoridis et al. Cross-modal variational alignment of latent spaces
CN114973097A (en) Method, device, equipment and storage medium for recognizing abnormal behaviors in electric power machine room
CN112801068A (en) Video multi-target tracking and segmenting system and method
CN111723667A (en) Human body joint point coordinate-based intelligent lamp pole crowd behavior identification method and device
CN115578770A (en) Small sample facial expression recognition method and system based on self-supervision
CN117475216A (en) Hyperspectral and laser radar data fusion classification method based on AGLT network
CN113850182A (en) Action identification method based on DAMR-3 DNet
CN115798055B (en) Violent behavior detection method based on cornersort tracking algorithm
CN116402811B (en) Fighting behavior identification method and electronic equipment
CN115690917B (en) Pedestrian action identification method based on intelligent attention of appearance and motion
Zhao et al. Research on human behavior recognition in video based on 3DCCA
Verma et al. Intensifying security with smart video surveillance
CN112101095B (en) Suicide and violence tendency emotion recognition method based on language and limb characteristics
CN114782995A (en) Human interaction behavior detection method based on self-attention mechanism
CN114511732A (en) Citrus spotted disease and insect pest fine-grained image identification method
CN114120076A (en) Cross-view video gait recognition method based on gait motion estimation
Nayak et al. Learning a sparse dictionary of video structure for activity modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant