CN109446923B - Deep supervision convolutional neural network behavior recognition method based on training feature fusion - Google Patents

Deep supervision convolutional neural network behavior recognition method based on training feature fusion Download PDF

Info

Publication number
CN109446923B
CN109446923B CN201811176393.6A CN201811176393A CN109446923B CN 109446923 B CN109446923 B CN 109446923B CN 201811176393 A CN201811176393 A CN 201811176393A CN 109446923 B CN109446923 B CN 109446923B
Authority
CN
China
Prior art keywords
video
local
layer
descriptor
evolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811176393.6A
Other languages
Chinese (zh)
Other versions
CN109446923A (en
Inventor
李侃
李杨
王欣欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201811176393.6A priority Critical patent/CN109446923B/en
Publication of CN109446923A publication Critical patent/CN109446923A/en
Application granted granted Critical
Publication of CN109446923B publication Critical patent/CN109446923B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a deep supervision convolutional neural network behavior recognition method based on training feature fusion, and belongs to the field of artificial intelligence computer vision. The method comprises the steps of extracting multilayer convolution characteristics of a target video, designing a local evolution pooling layer, and mapping the video convolution characteristics to a vector containing time information by using the local evolution pooling layer so as to extract a local evolution descriptor of the target video; encoding a plurality of local evolution descriptors into a meta-action based video level representation by using a VLAD encoding method; and integrating the multi-level classification results to obtain a final classification result by utilizing the complementarity of information among the multiple levels of the convolutional network. The method and the device fully utilize the time information to construct the video level representation, and effectively improve the accuracy of video behavior identification. Meanwhile, the discriminability of the network middle layer is improved by integrating the multi-level prediction results, so that the overall performance of the network is improved.

Description

Deep supervision convolutional neural network behavior recognition method based on training feature fusion
Technical Field
The invention relates to a behavior recognition method based on a video, in particular to a deep convolutional neural network behavior recognition method based on training feature fusion, and belongs to the field of artificial intelligence computer vision.
Background
At present, human behavior recognition is a research hotspot in the field of intelligent video analysis and is also an important research direction of video understanding tasks. In recent years, attention has been paid to video surveillance, abnormal event monitoring, content-based video retrieval, and the like. However, due to the complexity of human behavior, variability, interference of video background information, etc., it becomes critical how to establish an appropriate spatio-temporal level representation for video.
Early studies mainly focused on recognizing simple motions in ideal scenes, and employed behavior recognition methods based on artificially designed features, for example, a three-dimensional histogram (HOG3D) -based method, an optical flow Histogram (HOF) -based method, a motion boundary histogram-based method, and the like. These methods construct a representation of the video by region features centered around describing spatio-temporal points of interest (STIP) and are used to identify actions in the video.
With the rapid development of multimedia technology, data in networks and surveillance videos are rapidly increasing, and human behavior recognition technology based on real scenes is more and more concerned. Due to the problems of changes of human body shapes, visual angles, illumination and backgrounds, movement of cameras and the like, the traditional behavior identification method based on artificial design features is difficult to achieve ideal effects in the real scenes.
In recent years, with the rapid development and application of deep learning in the field of computer vision, a series of human video behavior recognition methods based on a depth model are proposed. For example, identifying behaviors in video from the level of a single frame, capturing motion information in video by a dual stream network using RGB frames and optical flow, learning spatio-temporal features of video segments by exploring a three-dimensional convolutional network on a video stream, etc., and a later proposed dual stream expanded three-dimensional convolutional network (I3D) that extends the set of convolution and pooling kernels in two dimensions in a convolutional neural network structure into three dimensions makes it possible for the network to seamlessly learn spatio-temporal features of video.
However, the existing convolutional neural network structure can only model a single frame or a short segment of a video, and lacks the capability of directly modeling long-time-sequence structural information of the video. Therefore, the existing behavior recognition method based on the depth model adopts different strategies to obtain the long-time-sequence spatiotemporal features of the video. These strategies fall into two main categories: (1) the deep convolutional feature coding and pooling method is to extract the convolutional features of a frame or a video segment by using a deep convolutional network, and then construct a global video-level representation by adopting a space-time coding or pooling method. However, the video representation constructed by this method is unordered, and does not take into account the timing and evolutionary relationship between video frames. (2) The video-level representation is constructed by considering the temporal structure of the video, i.e. the depth features of a number of frames or video segments are input into a temporal model, such as LSTM, GRU or a sorting function, which is fused into a video-level representation. However, this method may lack spatial local information of the video to some extent.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a deep supervised convolutional neural network behavior recognition method based on training feature fusion, aiming at solving the problems in the existing long time-series video representation method based on depth features and identifying the behavior of a person from the aspect of how to establish proper space-time level representation for a video.
The invention is realized by the following technical scheme.
A deep supervision convolutional neural network behavior recognition method based on training feature fusion comprises the following steps:
step 1: video data for training is collected to form a training data set.
And preprocessing the video in the training video data set, extracting all video frames, and cutting the video frames into the same size.
Step 2: the video in the training dataset is frame sampled.
Each video in the training dataset is subjected to uniform frame acquisition. Over the entire video span, to
Figure BDA0001823862180000021
For time intervals, T RGB frames [ I ] are collected uniformly1,I2,...,IT]Wherein, TzFor a certain total duration of video, let ItRepresenting the tth acquired video frame, which corresponds to the tth instant.
And step 3: the training data set is augmented.
And (3) all the video frames acquired from each video are inverted to be new videos, so that the training data set is expanded, and the number of the videos in the video data set is 2 times that of the videos in the previous training data set.
And 4, step 4: and extracting multilayer convolution characteristics of the training video frame.
Firstly, M convolutional layers are selected from a standard CNN (convolutional neural network) architecture and used for extracting multilayer convolutional characteristics of a video frame. Since identifying behavior typically requires a high level of semantic information, such as object or body parts, the present invention selects the M convolutional layers from the top convolutional layers of the convolutional network that are used to generate the feature map.
Then, the T RGB frames [ I ] of the collected video V1,I2,...,IT]Input into the convolutional network, and extract the feature map generated in the M convolutional layers for each RGB frame. For each RGB frame, a feature map of spatial size N, containing C channels, is obtained at each selected convolutional layer. For the entire video V, M × T feature maps of spatial size N × N, containing C channels, are obtained.
And 5: and performing feature aggregation on the multilayer feature map of the video frame to obtain video level representation. The specific method comprises the following steps:
step 5.1: and extracting the local evolution descriptor of the video V by using a local evolution sequencing pooling method.
Taking T feature maps obtained by multiple frames of the video V under the same convolutional layer as input, decomposing the feature maps of the frames into a group of local spatial features, and modeling evolution information of the local spatial features of each spatial position to generate a local evolution descriptor. The specific method comprises the following steps:
step 5.1.1: t frame [ I ] of video V via step 41,I2,...,IT]Each frame in (a) acquires a feature map of spatial size N × N and containing C channels at a selected convolutional layer, the feature map being represented as [ fm1,fm2,...,fmT]. The values of all channels at each spatial position on each feature map, T e { 1.,. T }, are connected separately, thereby decomposing each feature map into a plurality of local spatial features. For each frame, N × N C-dimensional local spatial features will be obtained.
Step 5.1.2: for T frame [ I1,I2,...,IT]The evolution information of each spatial position is modeled to generate a video V local evolution descriptor. The specific method comprises the following steps:
step 5.1.2.1: for a specific spatial position, the local spatial features of the T frames are represented as [ r ] in a time sequencei1,ri2,…,rit,...,riT]Where i ═ 1,. cndot.n },
Figure BDA0001823862180000031
is the local spatial feature of the ith spatial position at the t-th time,
Figure BDA0001823862180000032
real vector space in C dimension, i.e. ritIs a vector in the real vector space of dimension C.
Step 5.1.2.2: evolution information of the ith spatial position is modeled. Defining a ranking (Rank) function, calculating a score value for each time instant:
S(t,i∣e)=eTdit (1)
wherein,
Figure BDA0001823862180000033
is the average local spatial feature of the ith spatial position at time t,
Figure BDA0001823862180000034
the invention sets a constraint relationship: the score value corresponding to the later moment being greater than the score value corresponding to the earlier moment, i.e.
Figure BDA0001823862180000041
The parameter e may reflect the temporal order of these local spatial features. Learning the parameter e can be considered as a convex optimization problem:
Figure BDA0001823862180000042
the first term of the objective function E (e) is the general quadratic regularization term, and the second term is the soft count loss function change-loss.
Step 5.1.2.3: optimizing an objective function E (e), and mapping a series of local spatial features to a vector eThe above. e.g. of the typeThe local evolution descriptor contains the ordering information of the local spatial features, namely the local evolution descriptor. The method uses approximation techniques to solve the equation optimization problem, embedding the operation in the CNN network. Finally, the product is processedThe solution of the objective function is simplified as follows:
Figure BDA0001823862180000043
wherein alpha ist=2(T-t+1)-(T+1)(HT-Ht-1),
Figure BDA0001823862180000044
As a parameter, the weight is obtained by rank pooling (RankPooling). The above solution is seen as a weighted addition of the local spatial features of the ith spatial position over the T acquired time instants.
Step 5.1.2.4: and designing a local evolution sequencing pooling layer based on the approximate solution of the sequencing function. The layer inputs convolution characteristic diagram with size of N multiplied by C of T frame and outputs N multiplied by N local evolution descriptor vectors [ e ] with dimension of C1,e2,...,eN×N]。
Step 5.2: the local evolution descriptors of the video are encoded into meta-action based representations of the video level using a local evolution descriptor based VLAD (local aggregation vector) encoding method.
The method is based on the idea that one action is composed of a group of element actions, and provides a VLAD coding method based on local evolution descriptors. The method comprises the following specific steps:
step 5.2.1: using K meta-verb words, the feature space is transformed
Figure BDA0001823862180000045
Is divided into K units, and the anchor point of each unit is set as ak
Step 5.2.2: a series of local evolution descriptors [ e ] of the video V obtained in step 5.11,e2,...,eN×N]Is assigned to one of the K units and records a local evolution descriptor eiAnd anchor point akThe residual vector in between.
Step 5.2.3: the residual vectors are summed.
Figure BDA0001823862180000051
In the formula (4), the reaction mixture is,
Figure BDA0001823862180000052
representation descriptor eiSoft allocation of (2), anchor point akIn this formula is a hyper-parameter that can be adjusted by training; e.g. of the typei-akRepresenting the residual between the local evolution descriptor and the kth anchor point. H obtained by the formulakRepresenting the aggregation descriptor in the k-th cell.
Step 5.2.4: obtaining a sum of residuals between the local evolution descriptor and each anchor point of the video, and a video V can be expressed as V ═ h1,h2,...,hK],
Figure BDA0001823862180000053
C is the dimension of the real number space, and K is the number of the element action units, so ν is a matrix of C × K size in the real number space.
Based on the above equation, differentiable, and allowing the error gradient to propagate back to the lower layers of the network, the present invention designs a local evolution descriptor-based VLAD encoding layer.
Step 6: and for the selected M convolutional layers, performing the operations of the step 5 and the step 6 on each layer in parallel to obtain the video level characteristic representation of the video on each selected convolutional layer.
The method for recognizing the action of the video-level representation obtained by the plurality of convolutional layers is based on the action recognition method based on the deep supervision.
And 7: and (4) inputting the video level representation of each layer obtained in the step (6) into a corresponding classifier to obtain the classification result of the video V on the M selected convolutional layers. The specific method comprises the following steps:
step 7.1: to integrate all the parameters in the convolution and aggregation operations of the network, we define:
Figure BDA0001823862180000054
Figure BDA0001823862180000055
wherein B represents the total number of convolutional layers. Let B be { 1., B },
Figure BDA0001823862180000056
represents the parameters of the (b) th convolutional layer. M represents the number of the selected convolutional layers, and each selected convolutional layer is connected with a feature aggregation operation and a classifier because a classification result is obtained on each selected convolutional layer, so that the number of the feature aggregation operations is M, and the number of the classifiers is also M. Let M be { 1.. multidot.m }, so
Figure BDA0001823862180000061
Representing the weight of the feature aggregation operation on the mth selected convolutional layer,
Figure BDA0001823862180000062
representing the weights of the connected classifiers on the mth selected convolutional layer.
Step 7.2: defining a loss function that merges all output layer classification errors:
Figure BDA0001823862180000063
wherein L represents a video-level cross entropy loss function of the action classification, defined as:
Figure BDA0001823862180000064
wherein g is a real label of the video V, g belongs to A, and A is { A }1,...,AzDefinition ofAll action categories are represented, the number of the categories is Z, AiRepresents the ith action category, s, in the action set AmRepresents the motion type predicted by the mth convolutional layer.
And 8: and integrating the classification results of the M selected convolutional layers.
The invention provides a classification integration method for fusing multi-level prediction results, and the method uses corresponding weights to sum scores obtained from each convolution layer so as to fully utilize the complementarity of multi-level information. The corresponding weights are assigned by an attention-based method. The specific method comprises the following steps:
step 8.1: let the fused prediction result F be represented as:
Figure BDA0001823862180000065
wherein,
Figure BDA0001823862180000066
representing integration weights, wherein
Figure BDA0001823862180000067
Is a vector of Z dimension, obtained by assigning weights by Attention (Attention) mechanism, smRepresents the motion type of the mth convolutional layer prediction.
The loss function of the integration layer is defined as:
Figure BDA0001823862180000068
wherein y is argmax (f) and represents the action type finally predicted,
Figure BDA0001823862180000069
predict for the end that the action class is AiThe probability of (c).
Step 8.2: minimizing the following objective function on the training set, and learning to obtain all parameters W, Wc,wf
Figure BDA0001823862180000071
And step 9: and optimizing the loss function by using a gradient descent algorithm, and adjusting model parameters through back propagation until the loss function is converged. At this time, the deep convolutional neural network behavior recognition model based on trainable feature fusion is trained.
Step 10: and (5) using the model trained in the step 9 to identify the character behaviors in the unknown video V'. The method comprises the following specific steps:
step 10.1: preprocessing the unknown video V 'according to the method in the step 1 and the step 2 and sampling the frames to obtain T RGB frames [ I'1,I′2,...,I′T]。
Step 10.2: and (4) extracting multilayer convolution characteristics of the unknown video according to the method in the step 4. For each RGB frame of V', a feature map of spatial size N, containing C channels, is obtained at each selected convolutional layer. For the entire unknown video V', M × T feature maps with a spatial size of N × N, containing C channels, are obtained.
And 10.3, obtaining a video level feature representation of V' on each layer of the M selected convolutional layers according to the method in the steps 5 and 6. The method comprises the following specific steps:
first, using the local evolution rank pooling method to obtain N C-dimensional local evolution descriptor vectors [ e ' on each selected convolutional layer V ' according to the method described in step 5.1 '1,e′2,...,e′N×N],
Then [ e 'is coded using local evolution descriptor-based VLAD coding as per step 5.2'1,e′2,...,e′N×N]Video level representation v ' ═ h ' encoded based on meta motion '1,h′2,...,h′K],
Figure BDA0001823862180000072
Finally, the above operations are performed in parallel on the M selected convolutional layers, resulting in a video level representation of V' on each layer, as described in step 6.
Step 10.4: obtaining the classification result of V' on M selected convolutional layers s according to the method described in step 7′mThe action type result predicted by V' on the mth convolutional layer is shown. And according to the method in the step 8, integrating the multi-layer classification results by using a classification integration method to obtain the final classification result of the unknown video. F' represents the prediction after fusion:
Figure BDA0001823862180000073
wherein,
Figure BDA0001823862180000074
is a vector of dimension Z, s′mRepresents the motion type of the mth convolutional layer prediction.
After the above process is completed, the result of predicting the behavior of the human beings in the unknown video can be obtained.
Advantageous effects
Compared with the prior art, the invention has the following beneficial effects:
(1) the proposed feature aggregation operation combines the local evolution sequencing pooling operation and the VLAD coding operation based on the local evolution descriptor into a whole, and proposes the local evolution sequencing pooling layer and the VLAD coding layer based on the local evolution descriptor, thereby simplifying the implementation of the method;
(2) the proposed local evolution ranking pooling method captures more details about the action by modeling the time evolution information of each spatial position;
(3) the VLAD coding mode based on the local evolution descriptor generates video representation with more discriminative power by projecting the local evolution descriptor to a semantic space;
(4) the proposed deep surveillance action recognition method constructs a multi-level video representation in a single network and generates a plurality of prediction results;
(5) the provided integration method of the multi-level classification results improves the discriminability of the network middle layer by integrating the multi-level prediction results, thereby improving the overall performance of the network.
Drawings
FIG. 1 is a block diagram of the overall logic of the present invention.
FIG. 2 shows the steps of the method of the present invention and the propagation of parameters. The method comprises a model training step, a feature aggregation method and a deep supervision action recognition method.
FIG. 3 is a flow chart of the method of the present invention.
Detailed Description
The following will explain the embodiments of the present invention in further detail with reference to the accompanying drawings.
The execution environment of the invention is formed by three main functions which are realized by a computer: the method comprises the steps of firstly, extracting multilayer convolution characteristic extraction work, wherein the function is to extract a multilayer characteristic diagram of each frame of a video. The feature aggregation function comprises a local evolution description pooling layer, and the layer has the function of encoding the multi-frame feature map obtained by each layer into a local evolution descriptor; and a local evolution descriptor based VLAD encoding layer that functions to encode the local evolution descriptor into a meta-action based video level representation. And thirdly, a deep supervision action recognition method, which is used for recognizing the action of the person in the video by using the obtained multi-layer video-level representation and integrating the multi-layer classification results to obtain a final prediction result. The overall logical structure of the present invention is shown in fig. 1.
Fig. 3 is a flowchart of a deep supervised convolutional neural network behavior recognition method based on trainable feature fusion according to the present invention.
The following describes in more detail a specific embodiment of a deep supervised convolutional neural network behavior recognition method based on trainable feature fusion according to the present invention.
According to the flow chart of the model training phase shown in (b) of fig. 3, the specific implementation method of the model training phase is as follows:
step 1: the video in the training video dataset is pre-processed, all video frames are extracted and cropped to a size of 224px by 224 px.
Step 2: for each video in the training videos, the time interval is
Figure BDA0001823862180000091
Uniformly acquiring 10 RGB frames [ I ]1,I2,...,I10],TzFor a certain total duration of video, ItIt represents that the t-th collected video frame of a certain video, and for convenience, the t-th frame of a certain training video corresponds to the t-th moment thereof.
And step 3: and (3) reversing the video frames acquired by each video in the data set to form new videos so as to expand the training data set, wherein the number of the videos in the video data set is 2 times that of the videos in the previous training data set.
And 4, step 4: extracting the multilayer convolution characteristics of the training video frame, the invention selects 3 convolution layers in the pre-trained CNN architecture: the Mixed5_ a layer, Mixed5_ b layer, and Mixed5_ c layer are used to generate a feature map for a video frame. 10 RGB frames [ I ] of the captured video V1,I2,...,I10]The input to the convolutional network, for each RGB frame, a signature with a spatial size of 64 × 64, containing 3 channels, is obtained at each selected convolutional layer. For the entire video V, 3 × 10 feature maps with a spatial size of 64 × 64, containing 3 channels, are obtained.
And 5: performing feature aggregation on the multilayer feature map of the video frame to obtain a video-level representation, wherein the specific method comprises the following steps:
and 5.1, inputting the RGB frame acquired by each training video into the local evolution sequencing pooling layer to obtain the local evolution descriptor of each training video.
Step 5.1.1, through step 4, train 10 frames [ I ] of video V1,I2,...,I10]Each frame in (2) acquires a feature map of spatial size 64 x 64 and containing 3 channels at Mixed5_ a level, which can be represented as [ fm [ ]1,fm2,...,fm10]. Connection mtAll at each spatial positionThe value of the channel, t ∈ {1,...,10}, such that fm will betThe feature map is decomposed into 64 × 64 3-dimensional local spatial features.
Step 5.1.2, for T frame [ I ]1,I2,...,I10]The evolution information of each space position is modeled to generate a video V local evolution descriptor, and the specific method comprises the following steps:
step 5.1.2.1, the local spatial features of a particular spatial location i are sorted in time order to obtain the representation ri1,ri2,…rit,…,ri10]Where i ═ 1, ·,64},
Figure BDA0001823862180000101
is the local spatial feature of the ith spatial position at the t-th time,
Figure BDA0001823862180000102
real vector space in 3 dimensions, i.e. ritIs a vector in the 3-dimensional real vector space.
Step 5.1.2.2, using sorting function S (t, i |) e10ditCalculating a score value for each time t, wherein
Figure BDA0001823862180000103
Is the average local spatial feature of the ith spatial position at time t,
Figure BDA0001823862180000104
1-10 is corresponding to the time, and if q is equal to { 1., 10} is equal to the time after t is equal to { 1., 10}, then S (q, i |) is present>S (t, i |. e). Find all q satisfying the condition>t, calculation of E (e):
Figure BDA0001823862180000105
step 5.1.2.3, optimize E (e), map a series of local spatial features to a vector e。eNamely, the local evolution descriptor of the training video:
e=argmineE(e)
using an approximation technique to simplify the solution of e (e) as:
Figure BDA0001823862180000106
wherein alpha ist=2(10-t+1)-(10+1)(H10-Ht-1),
Figure BDA0001823862180000107
The weights are obtained by rank pooling (RankPooling). The above solution can be seen as a weighted addition of the local spatial features of the ith spatial position at all acquired 10 time instants.
Step 5.1.2.4, the learned e-vector is the local evolution descriptor of the ith spatial position of the training video, the whole training video is input, and 64 × 64 3-dimensional local evolution descriptor vectors [ e ] are obtained at Mixed5_ a layer1,e2,...,e64×64]。
Step 5.2: and inputting the local evolution descriptor vector of each training video into a VLAD coding layer based on the local evolution descriptor to obtain the video level representation of each training video.
Step 5.2.1, use 32 meta-verb words to transform feature space
Figure BDA0001823862180000108
Divided into 32 cells and then the local evolution descriptor e1,e2,...,e64×64To one of the 32 units. Recording local evolution descriptor eiWith each meta-action anchor point akResidual vector (e) betweeni-ak)。
Step 5.2.2, summing these residual vectors to obtain the aggregation descriptor h in the kth unitk
Figure BDA0001823862180000111
Step 5.2.3, the training video may be denoted as v ═ h1,h2,...,h32],
Figure BDA0001823862180000112
v is a matrix of 3 × 32 size in real space.
Step 6: the operations in step 5 above are performed in parallel at Mixed5_ a layer Mixed5_ b layer and Mixed5_ c layer, resulting in a video level representation of each training video on these 3 convolutional layers.
And 7: and obtaining the classification result of the training video in a plurality of convolutional layers.
And inputting the video level representation of each layer obtained in the step 6 into a corresponding classifier to obtain a classification result of the convolutional layer. The specific method comprises the following steps:
step 7.1, define parameters, total number of convolution layers of the whole network is B, and the parameter of the B-th convolution layer is expressed as
Figure BDA0001823862180000113
The selected convolutional layers are Mixed5_ a Mixed5_ b layer and Mixed5_ c layer 3, and each selected convolutional layer is connected with a feature aggregation operation and a classifier because a classification result is obtained on each selected convolutional layer, so the number of the feature aggregation operations is 3, and the number of the classifiers is also 3. Then the weight of the feature aggregation operation on the mth selected convolutional layer is
Figure BDA0001823862180000114
The weight of the classifier connected to the mth selected convolutional layer is
Figure BDA0001823862180000115
Figure BDA0001823862180000116
Figure BDA0001823862180000117
Step 7.2, the loss function for merging all output layer classification errors is defined as:
Figure BDA0001823862180000118
where L represents the video-level cross entropy loss function of the action class.
Let A be { A ═ A1,...,A51Define all action classes in the training dataset, with 51 classes as the number of classes. The real label of the training video is g ∈ A, smRepresents the motion type of the mth convolutional layer prediction. The cross entropy loss function is then:
Figure BDA0001823862180000119
and 8: and integrating the classification results of multiple layers.
Step 8.1, the integrated prediction result is as follows:
Figure BDA0001823862180000121
wherein
Figure BDA0001823862180000122
Representing integration weights, wherein
Figure BDA0001823862180000123
Is a vector in Z dimension, and is obtained by assigning weights to attention. The loss function of the integration layer is defined as:
Figure BDA0001823862180000124
wherein,
Figure BDA0001823862180000125
represents the finally predicted operation type, and P (y is a)i| V, W, wcm, wf is the final predicted action classOther than Ai.
Step 8.2, minimizing the objective function
Figure BDA0001823862180000126
Learning all parameters W, Wc,wf
And step 9: optimizing a loss function using a gradient descent algorithm
Figure BDA0001823862180000127
And adjusting model parameters through back propagation until the loss function is converged, and training the deep convolutional neural network behavior recognition model based on trainable feature fusion.
Step 10: and (3) identifying the character behaviors in the unknown video V' by using the model trained in the step 9, wherein the specific steps are as follows:
and step 10.1, preprocessing and frame sampling are carried out on the input unknown video according to the step 1 and the step 2, all video frames of the unknown video are extracted and cut into the size of 224px multiplied by 224 px. At time intervals of
Figure BDA0001823862180000128
Uniformly acquiring 10 RGB frames [ I'1,I′1,...,I′10]0.4s is the total duration of unknown video, I'tIndicating that the t-th acquired video frame of a certain video.
And step 10.2, extracting multilayer convolution characteristics of the unknown video according to the method in the step 4, and obtaining a characteristic diagram which has a space size of 64 multiplied by 64 and comprises 3 channels on each selected convolution layer for each RGB frame of V'. For the entire unknown video V', 3 × 10 feature maps with a spatial size of 64 × 64, including 3 channels, are obtained.
Step 10.3: according to the method in the step 5 and the step 6, the video level feature representation of V' on each layer of the 3 selected convolutional layers is obtained, and the specific steps are as follows:
v' is first obtained in each selected volume using a local evolution sorted pooling method as in step 5.164 x 64 3-dimensional local evolution descriptor vectors [ e 'on a layer stack'1,e′1,...,e′64×64],
Then [ e 'is coded using local evolution descriptor based VLAD coding as per step 5.2'1,e′1,...,e′64×64]Video level representation v ' ═ h ' encoded based on meta motion '1,h′2,...,h′32],
Figure BDA0001823862180000131
v' is a matrix of 3 × 32 size in real space.
Finally, following the method in step 6, the above operations are performed in parallel on the 3 selected convolutional layers Mixed5_ a, Mixed5_ b, and Mixed5_ c, resulting in a video level representation of V' on each layer.
Step 10.4, obtain the classification result of V' on 3 selected convolutional layers, s, according to the method in step 7′mRepresents the motion type result predicted by the unknown video V' on the mth convolutional layer. According to the method in the step 8, integrating the multi-layer classification results by using a classification integration method to obtain the final classification result of the unknown video:
Figure BDA0001823862180000132
wherein
Figure BDA0001823862180000133
Representing the integration weight.
After the above process is completed, the result of predicting the behavior of the human beings in the unknown video is 'running'.
The above detailed description is intended to illustrate the objects, aspects and advantages of the present invention, and it should be understood that the above detailed description is only exemplary of the present invention and is not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (6)

1. The deep supervision convolutional neural network behavior recognition method based on training feature fusion is characterized by comprising the following steps of:
step 1: collecting video data for training to form a training data set;
step 2: performing uniform frame sampling on each video in the training data set;
and step 3: expanding the training data set, wherein all the video frames acquired from each video are subjected to inversion operation to form a new video, so that the training data set is expanded, and the number of the videos in the video data set is 2 times that of the videos in the previous video data set;
and 4, step 4: extracting multilayer convolution characteristics of the training video frame;
firstly, selecting M convolutional layers from a standard convolutional neural network architecture for extracting multilayer convolutional characteristics of a video frame;
then, the T RGB frames [ I ] of the collected video V1,I2,...,IT]Inputting the RGB frame into the convolution network, and extracting a feature map generated in the M convolution layers by each RGB frame; for each RGB frame, obtaining a feature map with a space size of N multiplied by N and containing C channels on each selected convolution layer; for the whole video V, M × T feature maps with the space size of N × N and containing C channels are obtained;
and 5: performing feature aggregation on the multilayer feature map of the video frame to obtain a video level representation, wherein the specific method comprises the following steps:
step 5.1: extracting a local evolution descriptor of the video V by using a local evolution sequencing pooling method:
firstly, taking T feature maps obtained by a plurality of frames of a video V under the same convolutional layer as input, decomposing the feature maps of each frame into a group of local spatial features, and finally modeling evolution information of the local spatial features of each spatial position to generate a local evolution descriptor;
step 5.2: encoding a local evolution descriptor of a video into a meta-action based representation of the video level using a local aggregation vector encoding method based on the local evolution descriptor;
step 6: for the selected M convolutional layers, performing the operations of the step 5 and the step 6 on each layer in parallel to obtain the video level characteristic representation of the video on each selected convolutional layer;
and 7: inputting the video level representation of each layer obtained in the step 6 into a corresponding classifier to obtain the classification result of the video V on the M selected convolutional layers;
and 8: integrating the classification results of the M selected convolutional layers, wherein the specific method comprises the following steps:
step 8.1: let the fused prediction result F be represented as:
Figure FDA0003209263570000011
wherein,
Figure FDA0003209263570000021
wfthe weight of the integration is represented by,
Figure FDA0003209263570000022
is a vector of Z dimension, obtained by assigning weights by means of attention, smRepresents the motion type of the mth convolutional layer prediction;
the loss function of the integration layer is defined as:
Figure FDA0003209263570000027
wherein y denotes the action type finally predicted,
Figure FDA0003209263570000023
predict for the end that the action class is AiThe probability of (d);
step 8.2: minimizing the following objective function on the training set, and learning to obtain all parameters W, Wc,wf
Figure FDA0003209263570000024
And step 9: optimizing the loss function by using a gradient descent algorithm, and adjusting model parameters through back propagation until the loss function is converged;
step 10: and (3) identifying the character behaviors in the unknown video V' by using the model trained in the step 9, wherein the specific steps are as follows:
step 10.1: preprocessing the unknown video V 'according to the method in the step 1 and the step 2 and sampling the frames to obtain T RGB frames [ I'1,I′2,...,I′T];
Step 10.2: extracting multilayer convolution characteristics of the unknown video according to the method in the step 4; for each RGB frame of V', obtaining a feature map with a spatial size of N × N and containing C channels at each selected convolution layer; for the whole unknown video V', obtaining M × T feature maps with the space size of N × N and containing C channels;
step 10.3, according to the method of the step 5 and the step 6, obtaining a video level feature representation of V' on each layer of the M selected convolutional layers; the method comprises the following specific steps:
first, using the local evolution rank pooling method to obtain N C-dimensional local evolution descriptor vectors [ e ' on each selected convolutional layer V ' according to the method described in step 5.1 '1,e′2,...,e′N×N],
Then [ e 'is coded using local evolution descriptor-based VLAD coding as per step 5.2'1,e′2,...,e′N×N]Video level representation v ' ═ h ' encoded based on meta motion '1,h′2,...,h′K],
Figure FDA0003209263570000025
Figure FDA0003209263570000026
Finally, according to the method of step 6, performing the above operations in parallel on the M selected convolutional layers to obtain a video level representation of V' on each layer;
step 10.4: obtaining the classification result, s ', of V ' on the M selected convolutional layers according to the method in the step 7 'mRepresents the action type result predicted by V' on the mth convolution layer; according to the method in the step 8, integrating the multi-layer classification results by using a classification integration method to obtain the final classification result of the unknown video; f' represents the prediction after fusion:
Figure FDA0003209263570000031
wherein,
Figure FDA0003209263570000032
is a vector, s 'of dimension Z'mRepresents the motion type of the mth convolutional layer prediction.
2. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 1, wherein the method for performing uniform frame sampling in step 2 is as follows:
over the entire video span, to
Figure FDA0003209263570000033
For time intervals, T RGB frames [ I ] are collected uniformly1,I2,...,IT]Wherein, TzFor a certain total duration of video, let ItRepresenting the tth acquired video frame, which corresponds to the tth instant.
3. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 1, wherein the specific implementation method of step 5.1 is as follows:
step 5.1.1: warp beamStep 4, T frame [ I ] of video V1,I2,...,IT]Each frame in (a) acquires a feature map of spatial size N × N and containing C channels at a selected convolutional layer, the feature map being represented as [ fm1,fm2,...,fmT];
Respectively connecting the values of all channels at each spatial position on each feature map, and enabling the T to be in the range of { 1.,. T }, so as to decompose each feature map into a plurality of local spatial features;
for each frame, N × N C-dimensional local spatial features are obtained;
step 5.1.2: for T frame [ I1,I2,...,IT]The evolution information of each spatial position is modeled to generate a video V local evolution descriptor.
4. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 3, wherein the specific implementation method of step 5.1.2 is as follows:
step 5.1.2.1: for a specific spatial position, the local spatial features of the T frames are represented as [ r ] in a time sequencei1,ri2,…,rit,...,riT]Where i ═ 1,. cndot.n },
Figure FDA0003209263570000034
is the local spatial feature of the ith spatial position at the t-th time,
Figure FDA0003209263570000035
real vector space in C dimension, i.e. ritIs a vector on a C-dimensional real number vector space;
step 5.1.2.2: modeling evolution information of the ith spatial position; defining a ranking function, calculating a score value for each time instant:
S(t,i∣e)=eTdit (5)
wherein,
Figure FDA0003209263570000041
is the average local spatial feature of the ith spatial position at time t,
Figure FDA0003209263570000042
Figure FDA0003209263570000043
setting a constraint relationship: the score value corresponding to the later moment being greater than the score value corresponding to the earlier moment, i.e.
Figure FDA0003209263570000044
The parameter e reflects the temporal order of these local spatial features; learning the parameter e is considered a convex optimization problem:
Figure FDA0003209263570000045
the first term of the objective function E (e) is a general quadratic regularization term, and the second term is a soft count loss function change-loss;
step 5.1.2.3: optimizing an objective function E (e), and mapping a series of local spatial features to a vector eThe above step (1); e.g. of the typeThe local evolution descriptor comprises sequencing information of the local spatial features, namely the local evolution descriptor; the solution of the above objective function is simplified as:
Figure FDA0003209263570000046
Figure FDA0003209263570000047
wherein alpha ist=2(T-t+1)-(T+1)(HT-Ht-1),
Figure FDA0003209263570000048
The weights are obtained by sorting the pools as parameters, the solution being regarded as a local part of the ith spatial position at the T acquired momentsWeighted addition of spatial features;
step 5.1.2.4: designing a local evolution sequencing pooling layer based on the approximate solution of the sequencing function; the layer inputs convolution characteristic diagram with size of N multiplied by C of T frame and outputs N multiplied by N local evolution descriptor vectors [ e ] with dimension of C1,e2,...,eN×N]。
5. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 1, wherein the specific implementation method of step 5.2 is as follows:
step 5.2.1: using K meta-verb words, the feature space is transformed
Figure FDA0003209263570000049
Is divided into K units, and the anchor point of each unit is set as ak
Step 5.2.2: a series of local evolution descriptors [ e ] of the video V obtained in step 5.11,e2,...,eN×N]Is assigned to one of the K units divided at step 5.2.1, and records a local evolution descriptor eiAnd anchor point akThe residual vector between;
step 5.2.3: summing the residual vectors;
Figure FDA0003209263570000051
in the formula (8), the reaction mixture is,
Figure FDA0003209263570000052
representation descriptor eiSoft allocation of (2), anchor point akIn this formula is a hyper-parameter adjusted by training; e.g. of the typei-akRepresenting a residual between the local evolution descriptor and the kth anchor point; h obtained by the formulakRepresents an aggregation descriptor in the kth cell;
step 5.2.4: obtain the videoIs calculated from the local evolution descriptor and the residual between each anchor point, video V is represented as
Figure FDA0003209263570000053
Figure FDA0003209263570000054
C is the dimension of real number space, and K is the number of element action units; v is a matrix of C K size over the real space.
6. The deep supervised convolutional neural network behavior recognition method based on training feature fusion as claimed in claim 1, wherein the specific implementation method of step 7 is as follows:
step 7.1: defining:
Figure FDA0003209263570000055
Figure FDA0003209263570000056
wherein B represents the total number of convolutional layers; let B be { 1., B },
Figure FDA0003209263570000057
parameters representing the b-th convolutional layer; m represents the number of the selected convolution layers; let M be { 1.. multidot.m }, so
Figure FDA0003209263570000058
Representing the weight of the feature aggregation operation on the mth selected convolutional layer,
Figure FDA0003209263570000059
representing the weight of the classifier connected to the mth selected convolutional layer;
step 7.2: defining a loss function that merges all output layer classification errors:
Figure FDA00032092635700000510
wherein L represents a video-level cross entropy loss function of the action classification, defined as:
Figure FDA00032092635700000511
wherein g is a real label of the video V, g belongs to A, and A is { A }1,...,AzDefine all action categories with the number of categories Z, AiRepresents the ith action category, s, in the action set AmRepresents the motion type predicted by the mth convolutional layer.
CN201811176393.6A 2018-10-10 2018-10-10 Deep supervision convolutional neural network behavior recognition method based on training feature fusion Active CN109446923B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811176393.6A CN109446923B (en) 2018-10-10 2018-10-10 Deep supervision convolutional neural network behavior recognition method based on training feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811176393.6A CN109446923B (en) 2018-10-10 2018-10-10 Deep supervision convolutional neural network behavior recognition method based on training feature fusion

Publications (2)

Publication Number Publication Date
CN109446923A CN109446923A (en) 2019-03-08
CN109446923B true CN109446923B (en) 2021-09-24

Family

ID=65546295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811176393.6A Active CN109446923B (en) 2018-10-10 2018-10-10 Deep supervision convolutional neural network behavior recognition method based on training feature fusion

Country Status (1)

Country Link
CN (1) CN109446923B (en)

Families Citing this family (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084151B (en) * 2019-04-10 2023-02-28 东南大学 Video abnormal behavior discrimination method based on non-local network deep learning
CN110097000A (en) * 2019-04-29 2019-08-06 东南大学 Video behavior recognition methods based on local feature Aggregation Descriptor and sequential relationship network
CN110188635B (en) * 2019-05-16 2021-04-30 南开大学 Plant disease and insect pest identification method based on attention mechanism and multi-level convolution characteristics
CN110119749A (en) * 2019-05-16 2019-08-13 北京小米智能科技有限公司 Identify method and apparatus, the storage medium of product image
CN110490035A (en) * 2019-05-17 2019-11-22 上海交通大学 Human skeleton action identification method, system and medium
CN110334589B (en) * 2019-05-23 2021-05-14 中国地质大学(武汉) High-time-sequence 3D neural network action identification method based on hole convolution
CN110135386B (en) * 2019-05-24 2021-09-03 长沙学院 Human body action recognition method and system based on deep learning
CN110390336B (en) * 2019-06-05 2023-05-23 广东工业大学 Method for improving feature point matching precision
CN110378208B (en) * 2019-06-11 2021-07-13 杭州电子科技大学 Behavior identification method based on deep residual error network
CN110334321B (en) * 2019-06-24 2023-03-31 天津城建大学 City rail transit station area function identification method based on interest point data
CN110457996B (en) * 2019-06-26 2023-05-02 广东外语外贸大学南国商学院 Video moving object tampering evidence obtaining method based on VGG-11 convolutional neural network
CN110348494A (en) * 2019-06-27 2019-10-18 中南大学 A kind of human motion recognition method based on binary channels residual error neural network
CN112241673B (en) * 2019-07-19 2022-11-22 浙江商汤科技开发有限公司 Video processing method and device, electronic equipment and storage medium
CN110633630B (en) * 2019-08-05 2022-02-01 中国科学院深圳先进技术研究院 Behavior identification method and device and terminal equipment
CN110533101A (en) * 2019-08-29 2019-12-03 西安宏规电子科技有限公司 A kind of image classification method based on deep neural network subspace coding
CN110765854B (en) * 2019-09-12 2022-12-02 昆明理工大学 Video motion recognition method
CN110826522A (en) * 2019-11-15 2020-02-21 广州大学 Method and system for monitoring abnormal human behavior, storage medium and monitoring equipment
CN111079674B (en) * 2019-12-22 2022-04-26 东北师范大学 Target detection method based on global and local information fusion
CN111103275B (en) * 2019-12-24 2021-06-01 电子科技大学 PAT prior information assisted dynamic FMT reconstruction method based on CNN and adaptive EKF
CN111242044B (en) * 2020-01-15 2022-06-28 东华大学 Night unmanned vehicle scene prediction method based on ConvLSTM dual-channel coding network
CN111325149B (en) * 2020-02-20 2023-05-26 中山大学 Video action recognition method based on time sequence association model of voting
CN111325155B (en) * 2020-02-21 2022-09-23 重庆邮电大学 Video motion recognition method based on residual difference type 3D CNN and multi-mode feature fusion strategy
CN111382403A (en) * 2020-03-17 2020-07-07 同盾控股有限公司 Training method, device, equipment and storage medium of user behavior recognition model
WO2021204143A1 (en) * 2020-04-08 2021-10-14 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Methods for action localization, electronic device and storage medium
CN111860432B (en) * 2020-07-30 2023-11-24 中国海洋大学 Ternary relation cooperation module and modeling method for video space-time characterization learning
CN112347963B (en) * 2020-11-16 2023-07-11 申龙电梯股份有限公司 Elevator door blocking behavior identification method
CN112541081B (en) * 2020-12-21 2022-09-16 中国人民解放军国防科技大学 Migratory rumor detection method based on field self-adaptation
CN112699786B (en) * 2020-12-29 2022-03-29 华南理工大学 Video behavior identification method and system based on space enhancement module
CN112668495B (en) * 2020-12-30 2024-02-02 东北大学 Full-time space convolution module-based violent video detection algorithm
CN112784698B (en) * 2020-12-31 2024-07-02 杭州电子科技大学 No-reference video quality evaluation method based on deep space-time information
CN112990013B (en) * 2021-03-15 2024-01-12 西安邮电大学 Time sequence behavior detection method based on dense boundary space-time network
CN113221693B (en) * 2021-04-29 2023-07-28 苏州大学 Action recognition method
CN113139530B (en) * 2021-06-21 2021-09-03 城云科技(中国)有限公司 Method and device for detecting sleep post behavior and electronic equipment thereof
CN113327299B (en) * 2021-07-07 2021-12-14 北京邮电大学 Neural network light field method based on joint sampling structure
CN114758304B (en) * 2022-06-13 2022-09-02 江苏中腾石英材料科技股份有限公司 High-purity rounded quartz powder sieving equipment and sieving control method thereof
CN117332352B (en) * 2023-10-12 2024-07-05 国网青海省电力公司海北供电公司 Lightning arrester signal defect identification method based on BAM-AlexNet

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701507A (en) * 2016-01-13 2016-06-22 吉林大学 Image classification method based on dynamic random pooling convolution neural network
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN107169415A (en) * 2017-04-13 2017-09-15 西安电子科技大学 Human motion recognition method based on convolutional neural networks feature coding
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
US9946933B2 (en) * 2016-08-18 2018-04-17 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701507A (en) * 2016-01-13 2016-06-22 吉林大学 Image classification method based on dynamic random pooling convolution neural network
US9946933B2 (en) * 2016-08-18 2018-04-17 Xerox Corporation System and method for video classification using a hybrid unsupervised and supervised multi-layer architecture
CN107169415A (en) * 2017-04-13 2017-09-15 西安电子科技大学 Human motion recognition method based on convolutional neural networks feature coding
CN107066973A (en) * 2017-04-17 2017-08-18 杭州电子科技大学 A kind of video content description method of utilization spatio-temporal attention model
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks

Also Published As

Publication number Publication date
CN109446923A (en) 2019-03-08

Similar Documents

Publication Publication Date Title
CN109446923B (en) Deep supervision convolutional neural network behavior recognition method based on training feature fusion
Wang et al. Depth pooling based large-scale 3-d action recognition with convolutional neural networks
Asadi-Aghbolaghi et al. A survey on deep learning based approaches for action and gesture recognition in image sequences
CN110175580B (en) Video behavior identification method based on time sequence causal convolutional network
Zhu et al. Temporal cross-layer correlation mining for action recognition
Özyer et al. Human action recognition approaches with video datasets—A survey
CN111652903B (en) Pedestrian target tracking method based on convolution association network in automatic driving scene
Wang et al. Gan-knowledge distillation for one-stage object detection
Serpush et al. Complex human action recognition using a hierarchical feature reduction and deep learning-based method
CN108399435A (en) A kind of video classification methods based on sound feature
Balasubramanian et al. Analysis of facial emotion recognition
Bai et al. Correlative channel-aware fusion for multi-view time series classification
CN109446897B (en) Scene recognition method and device based on image context information
Ding et al. A lightweight action recognition method for unmanned-aerial-vehicle video
Xue et al. Crowd scene analysis encounters high density and scale variation
Serpush et al. Complex human action recognition in live videos using hybrid FR-DL method
Dey et al. Umpire’s Signal Recognition in Cricket Using an Attention based DC-GRU Network
CN117593794A (en) Improved YOLOv7-tiny model and human face detection method and system based on model
Mahjoub et al. A flexible high-level fusion for an accurate human action recognition system
Bux Vision-based human action recognition using machine learning techniques
Zhao et al. Research on human behavior recognition in video based on 3DCCA
Yang et al. Attentional fused temporal transformation network for video action recognition
Sudhakaran et al. Top-down attention recurrent VLAD encoding for action recognition in videos
Butt et al. Leveraging Transfer Learning for Spatio-Temporal Human Activity Recognition from Video Sequences.
Li et al. Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant