CN114596520A - First visual angle video action identification method and device - Google Patents

First visual angle video action identification method and device Download PDF

Info

Publication number
CN114596520A
CN114596520A CN202210120923.5A CN202210120923A CN114596520A CN 114596520 A CN114596520 A CN 114596520A CN 202210120923 A CN202210120923 A CN 202210120923A CN 114596520 A CN114596520 A CN 114596520A
Authority
CN
China
Prior art keywords
feature
module
video
mode
mciam
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210120923.5A
Other languages
Chinese (zh)
Inventor
聂梦真
姜金印
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202210120923.5A priority Critical patent/CN114596520A/en
Publication of CN114596520A publication Critical patent/CN114596520A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method and a device for identifying a first visual angle video action, which are used for constructing a multi-stream network model for identifying the first visual angle action, wherein the model comprises a Convolutional Neural Network (CNN), a Transformer network and the like. The model adopts an RGB mode and a depth mode, is divided into three stages for action classification, extracts the double-scale characteristics of a video frame through a convolutional neural network pre-trained on ImageNet, respectively adopts different intra-frame segmentation modes according to the respective characteristics of different modes and different scale characteristic graphs, enhances the spatial representation by combining a correlation calculation mechanism, improves the spatial semantic information, and generates cross mode representation and enhances the cross correlation among the modes through the interaction of a multi-scale cross-mode fusion module; extracting time sequence information between video frames based on an attention mechanism; bimodal data enhanced through spatial interaction are fused, bimodal spatiotemporal information is effectively utilized and fused, and a good action recognition effect can be achieved.

Description

First visual angle video action identification method and device
Technical Field
The invention belongs to the fields of deep learning, computer vision and the like, relates to a feature extraction and action recognition technology, and particularly relates to a first visual angle video action recognition method and device.
Background
Action Recognition (Action Recognition) based on video data is an important research direction in the field of computer vision, the task goal is to finish a video classification task according to human actions appearing in videos for unprocessed or given segmented video segments, and the task has wide application value in the fields of video security monitoring, daily behavior Recognition, behavior interaction and the like.
In recent years, many studies in the field of computer vision have been made by means of Deep Learning (Deep Learning). The Convolutional Neural Network (CNN) has strong feature extraction capability and can better extract feature information of a feature map. The 2D convolutional neural network is mature on a picture classification task, is transferred to a video action classification task, directly acts on each frame of data, extracts depth features through a multilayer convolutional network, fuses the features of each frame in a certain mode, and inputs the features to a full connection layer FC (full connection) through a feed-forward network to complete the classification task. The method can better extract the representation information of the video frames, and the final classification result is also determined by the representation information of each frame. However, because the method splits the time sequence expression between frames, a certain effect can be finally obtained only on partial tasks (the motion information in the video is not easy to distinguish, and the representation information is easy to distinguish), and the method lacks excellent generalization performance for data needing to better distinguish the motion information between frames.
The image classification task in the computer vision field is based on two-dimensional data and emphasizes processing of spatial information in frames, on the basis, the video motion recognition task not only needs to process the spatial semantics of each frame, but also introduces a time dimension T to represent the time sequence between video frames, and the task is more complex and is the research emphasis in the computer vision field at present. The video classification task adds time sequence information compared with the image classification task, a common 3D convolution network (for example, the combination of two-dimensional convolution of a space domain and one-dimensional convolution of a time domain) simultaneously processes intra-frame space information in a video segment and time sequence information between adjacent frames, and the motion information in a video is considered while processing representation information of a video frame, so that the accuracy rate is higher than that of a 2D convolution neural network method. The method of unfolded 3D ConvNet (I3D) is used as optimization of a 3D convolution network, a trained 2D convolution kernel is expanded to a 3D category, and the method can benefit from a pre-training model of a large ImageNet data set, can further improve the accuracy of the model, and can generate larger calculation loss.
The method of spatiotemporal integration is one of the mainstream directions of research in the field of motion recognition. The method mainly focuses on processing of spatial information and time sequence information, the representation data of each frame represents spatial information such as an object, a scene and the like, and the motion expression of the frames represents time sequence information such as the motion of people and objects in a video, the motion of a camera and the like. The spatio-temporal network method usually uses RGB frames as one input for processing spatial stream information, uses a plurality of stacked optical frames as one input for processing time sequence information, adopts two-way joint training, and finally performs complementary fusion on extracted deep features to complete classification tasks. Although spatiotemporal integration can well fuse spatiotemporal information, when processing high-complexity and long-time video data, the classification effect is not good, and the modeling cannot be well performed.
The long and short term memory network (LSTM) method is commonly used for sequence modeling problems such as natural language processing and is suitable for processing sequence data. In consideration of the complex time sequence relation among video frames, the video frames are transferred to a video action classification task, and the problems of gradient loss and gradient explosion in the long sequence training process can be solved. Meanwhile, the LSTM method can control the transmission state in a gating mode, model long-time sequence information and obtain higher accuracy rate than a Recurrent Neural Network (RNN) in longer sequence data. However, in the task of processing a long video sequence, the LSTM method has high computational complexity, high training difficulty, and low efficiency of time sequence modeling.
In recent years, attention has been drawn to a mechanism as an important method in deep learning, and intensive research has been conducted in various application fields. The attention mechanism in deep learning is similar to that of human vision, namely attention is concentrated on important points in a plurality of information, key information is selected, and other unimportant information is ignored. The attention mechanism breaks through the limitation that the RNN model can not perform parallel computation, allows the modeling of the dependent items of the input and output sequence without considering the distance of the dependent items in the sequence, can well solve the long-term dependence problem, and is widely applied to the third visual angle action recognition task.
The algorithm based on deep learning is widely applied to human activity recognition, and meanwhile, the visual task at the first visual angle is also concerned due to more practical applications. However, due to the problems of the movement of the shooting lens, the shooting sampling distance and the like, the inter-frame of the first visual angle visual data has the problems of relatively large changes of a sporter, an object and a background, and the like, and the video data of the first visual angle and the video of the third visual angle have large differences. Meanwhile, the traditional video motion recognition algorithm has low concern on the correlation of spatial information in frames, and has less consideration on the spatial feature distribution on feature maps, so that the traditional video motion recognition algorithm has certain difficulty in the task of video motion recognition at a first visual angle and low generalization performance.
Disclosure of Invention
The invention aims to overcome the defects of the existing deep learning method and provide a method and a device for identifying the action of a first visual angle video, which can acquire the related weight of video data and enhance the space-time semantic representation by combining the characteristics of a first visual angle video data set and a multi-scale bimodal space-time representation attention mechanism, and simultaneously obtain higher classification accuracy by combining a cross fusion module.
The technical scheme for realizing the purpose of the invention is as follows:
the invention provides a first visual angle video motion recognition method, which comprises the steps of inputting a first visual angle video motion data set into a multi-scale network based on an RGB (red, green and blue) mode and a depth mode to extract spatial semantics, wherein the network adopts a convolutional neural network CNN, two different convolutional blocks in the convolutional neural network CNN are selected to respectively output feature maps of two scales, a first class of feature maps contain certain spatial information, a second class of feature maps contain rich high-level semantic information, the first class of feature maps are processed through an MCIAM I module, the second class of feature maps are processed through an MCIAM II module to further obtain feature embedded vectors with rich multi-scale bimodal spatial semantics, the feature embedded vectors are used as the input of an Inter-frame Encoder module to be processed, the extraction of an Inter-frame time sequence relation is completed through the processing of a plurality of the Inter-frame Encoder modules, and three feature embedded vectors are obtained, the method comprises the steps that data of an RGB branch and data of a depth branch are fused through a CFAM module, feature embedding vectors of the multi-scale fusion branch are fused, a combined feature embedding vector is generated, the combined feature embedding vector is processed through a linear layer to obtain an action classification result of each frame, then video frames of an action segment are subjected to average processing along a time sequence direction, and a final identification result is output.
Furthermore, the MCIAM I module firstly adopts an average segmentation or edge cross segmentation mode to segment the first class of feature maps, then maps the feature maps into embedded vectors through feature embedding and linear mapping, adds position information, calculates a weight matrix of the embedded vectors generated by the same segmentation mode of the RGB mode and the depth mode, and enhances the spatial correlation among the modes.
Furthermore, the MCIAM II module adopts a horizontal and vertical segmentation characteristic segmentation mode to calculate the spatial correlation between the RGB mode and the depth mode, and respectively fuses four embedded vectors generated by the MCIAM I module according to different modes to complete multi-scale bimodal spatial semantic enhancement.
Furthermore, the Inter-frame Encoder module uses trainable linear projection to complete linear mapping to generate a corresponding feature embedded vector, adds a position code to the feature embedded vector for encoding position information of a sequence frame or each Patch, then performs multi-head attention operation to obtain an intermediate vector, and finally inputs the intermediate vector into a feedforward network and completes residual connection and layer regularization operations.
The method evaluates the difference degree between the real action label of the video and the prediction result of the current model through a cross entropy loss function, wherein the loss function is as follows:
Figure BDA0003498363950000031
where n represents the number of behavior classes, i represents the actual class number to which the video data currently being processed by the network belongs, and y represents the actual class number to which the video data currently being processed by the network belongsiTrue tags, p, representing corresponding classesiRepresenting the probability values for the corresponding categories predicted by the model.
And, carry on the data preconditioning and data enhancement before inputting the first visual angle video motion data set into the network.
The convolutional neural network CNN uses a ResNet-34 model pre-trained on an ImageNet dataset with a multi-layered convolutional residual block of ResNet-34 as a basic component element.
In another aspect, the present invention further provides a first-view video motion recognition apparatus, including:
a convolutional neural network CNN module for extracting feature information, performing feature extraction of RGB mode and depth mode at 2D feature map level, and outputting the output value of (N)1,N1) And (N)2,N2) Feature maps of two scales;
multimode cross-frame attention module MCIAM I for processing (N)1,N1) The characteristic graph of the scale adopts two different characteristic segmentation modes of average segmentation and edge cross segmentation, then adopts characteristic embedding and linear mapping to map the characteristic graph into embedded vectors, adds position information, calculates a weight matrix of the embedded vectors generated by the same cutting mode of the two modes, and enhances the spatial correlation between the modes;
multimode cross-frame attention module MCIAM II for processing (N)2,N2) The feature map of the scale adopts a feature segmentation mode of horizontal and vertical segmentation to calculate the spatial correlation among the modes, the calculation process is the same as that of the MCIAM I module, and four embedded vectors generated by the MCIAM I module are respectively fused according to the difference of the modes to complete the multi-scale bimodal spatial semantic enhancement;
the Inter-frame Encoder is used for modeling a time process, finishing the processing of action time sequence information by utilizing a self-attention mechanism, modeling the long-term relationship of motion, reasonably distributing the weight of Inter-frame feature embedding, inhibiting the interference of irrelevant objects and objects in a video and distributing more attention resources for a focus area;
and the cross fusion attention module CFAM is used for finishing joint representation of a time sequence network RGB mode and a depth mode, learning a shared structure among different modes through a mutual attention mechanism, and leading the cross fusion module to be responsible for final multi-scale space-time information fusion and generating a joint feature embedded vector.
Furthermore, the device also comprises a preprocessing module which is used for preprocessing and enhancing the data of the video motion data of the first visual angle in a random cutting mode.
A third aspect of the present invention provides a storage medium having a computer program stored therein, wherein the program is executable by a terminal device or a computer to perform the method of the first aspect of the present invention.
The network based on the interframe time sequence relation modeling can well complete the action recognition under the third visual angle, but because the shooting relation, the video of the first visual angle has irregular jitter and background movement, and the action recognition task of the first visual angle is mostly interactive action (such as the interaction between people and objects), more fine-grained action recognition needs to be carried out, and not only the action of a motion subject needs to be recognized, but also an interactive object needs to be recognized. In the invention, different segmentation interaction modes are adopted for feature maps with different scales before a time sequence information learning module, so that the representation capability of a moving subject operation object is enhanced, and the learning capability of a network on spatial information is enhanced.
By analyzing the complementary relation among the modes, the invention provides a characteristic embedded vector grouping and fusing scheme, and the analysis is carried out from three layers: on a spatial level, potential spatial semantic relations of different scales among the modes can be well mined; on the fusion period level, two fusion architectures of different periods are adopted, the first architecture firstly acquires semantic information of a multi-scale space level subjected to inter-modal fusion, inter-frame time sequence relation is processed later, multi-scale fusion of space-time information is completed, the second architecture firstly processes respective time sequence information of an RGB (red, green and blue) mode and a depth mode and then completes fusion, the two architectures respectively concern different attention information, and fusion semantics are enriched; on the multi-stream architecture level, the complementation of different information is realized in a cross fusion mode, and the accuracy of action recognition is well improved.
A large number of experiments carried out on the FPHA reference show the effectiveness of each module, and more importantly, the invention provides a novel method for identifying the first visual angle action by combining multi-scale multi-mode space-time information fusion.
The invention has the advantages and beneficial effects that:
1. according to the invention, the multi-mode cross-frame attention modules MCIAM I and MCIAM II are utilized to realize a multi-scale multi-mode inter-space interactive attention mechanism based on Patch, information of different modes can be fused and enhanced, the optimal representation effect is learned, and abundant semantic information is provided for subsequent time sequence information processing work. Meanwhile, by combining the characteristics of the first visual angle video action data set and adopting various Patch segmentation strategies, the edge information of the central area of the cutting position is fully utilized, the generalization capability of the model can be enhanced, and good effects can be obtained in experiments of different scenes and motions.
2. The invention designs a dual-mode-based feature embedding vector grouping fusion strategy, retains the unique attribute of each mode, better performs multi-scale fusion on low-level features containing abundant spatial information and high-level features containing abundant deep semantics through certain embedding mapping conversion and other modes, and simultaneously explores sharable information in a unified depth architecture by combining the respective characteristics of RGB and depth data.
Drawings
FIG. 1a is an example of bimodal data (clapping action, RGB modality) employed by the present invention;
FIG. 1b is an example of bimodal data (clapping action, depth modality) employed by the present invention;
FIG. 2 is a network overall structure diagram of a first-view video motion recognition method according to an embodiment of the present invention;
FIG. 3 is a diagram of a model of the MCIAM I module;
FIG. 4 is a model diagram of the MCIAM II module;
FIG. 5 is a model diagram of an Interframe Encoder module;
FIG. 6 is a model diagram of a CFAM module;
Detailed Description
The present invention is further described in the following examples, which are intended to be illustrative, not limiting and are not intended to limit the scope of the invention.
The invention provides a first visual angle video motion recognition method which uses an end-to-end network for training.
As shown in fig. 1a and fig. 1b, the invention adopts bimodal first person video motion data to carry out experiments.
As shown in fig. 2, the present invention adopts a network structure based on RGB mode and depth mode, and the multi-stream network trains together, and simultaneously, the data of the two modes in time sequence are in one-to-one correspondence with the content of the video frame. The invention selects an FPHA data set to carry out experiments, and the data set takes part in 1175 videos belonging to 45 action categories respectively by 24 different objects.
As shown in fig. 1a and 1b, most of video data clips of the first perspective motion recognition task are short, the motion range of the motion is small, the occupied visual space is large, and certain context redundancy can be caused.
The invention constructs a feature extraction network, which takes a multilayer convolution residual block of ResNet-34 as a basic component element and uses a ResNet-34 model pre-trained on an ImageNet data set. Considering that extracted feature space semantics are gradually reduced and high-level semantics are gradually increased along with the continuous increase of the depth of a convolutional neural network, and meanwhile, two semantic information have important functions on a first visual angle action recognition task, in order to give consideration to the function of a pre-training feature extraction model, ensure that the high-level semantics extract the space information of a feature map to a certain extent and improve intra-frame correlation, a fourth layer convolution block and a fifth layer convolution block in a ResNet-34 network are selected to respectively output the feature maps with two scales, a first class of feature maps contains certain space information, a second class of feature maps contains rich high-level semantic information, a multi-stream multi-scale network structure is adopted to extract rich space semantics, and MCIAM I and MCIAM II modules are designed.
The network obtains a feature embedded vector with rich multi-scale bimodal space semantics through the processing of an MCIAM I module and an MCIAM II module, and the vector is used as the input of an Inter-frame Encoder module for processing.
And completing the extraction of the Inter-frame time sequence relation by processing a plurality of Inter-frame Encoder modules to obtain three characteristic embedded vectors which are respectively generated by an RGB branch, a depth branch and a multi-scale fusion branch, fusing the data of the RGB branch and the depth branch by a CFAM module, completing the fusion of the characteristic embedded vectors of the multi-scale fusion branch and generating the final characteristic embedded vector.
As shown in fig. 3 and 4, the present invention introduces a specific implementation of multi-mode cross-frame intra attention modules (MCIAM i, MCIAM ii). Both modules employ a similar underlying attention mechanism. The present invention will be described by taking the RGB modality leg in the framework as an example.
For the input feature map F ∈ BTxxHxW (BT: batch size numbers of frames, C: channel, H: height, W: width), the input feature map of the MCIAM I module is F _ R1∈4x256xN1xN1The input characteristic diagram of the MCIAM II module is f _ R2∈128x512xN2xN2. For feature map f _ R1Two segmentation methods are adopted, the first method is to segment the characteristic diagram into 4 intra-frame Patch by the central axis, and the correlation among the Patch is extracted. Specifically, the feature map f _ R1Deformation of f _ R1∈BTxNxCP2Wherein P represents the width and height of the newly generated feature map,
Figure BDA0003498363950000061
representing the number of generated Patch blocks, passing the CP through the linear layer2Mapping into D to obtain a feature embedding vector, inputting the feature embedding vector into a cross fusion attention module (the module has the same structure as the CFAM module) to generate a feature embedding vector F _ R1'E 128x4x 512. The second segmentation mode is that the edge information of the segmentation position is fully utilized through flattening operation to enhance the richness of the spatial information of the edge position of the learned feature, and then the spatial information is input into a cross fusion attention module to generate a feature embedded vector F _ R1'‘∈128x4x512。
The input characteristic diagram of the MCIAM II module is f _ R2∈4x512xN2xN2Inputting the cross fusion attention module to generate the feature embedding vector F _ R2'‘The F _ R belongs to 128x4x512 and is generated by an MCIAM I module in two segmentation interaction modes1'And F _ R1'‘And F _ R2'‘Adding, and then summing the F _ D generated by the contemporaneous depth mode legs2'‘And adding to generate a feature embedding vector F _ Cross fusing the spatial correlation of the two modes of the two scales.
FIG. 5 shows an Inter-frame Encoder module for Inter-frame timing relationship extraction according to the present invention. To input the characteristic diagram f _ R of the module2For example, a trainable linear projection is used to complete a linear mapping to generate a corresponding feature embedding vector f _ R2'And adding position codes for the position information of the coded sequence frame or each Patch. Then, performing multi-head attention operation, specifically, firstly mapping the embedded vector to a series of query, key and value vectors of a dimension D by using linear projection, secondly calculating a dot product of the query and the key vectors by softmax to obtain corresponding attention weights, and then performing weighted summation with the value vectors, wherein for the operation in each head, the following formula can be used for representing:
Figure BDA0003498363950000071
wherein Q represents a query vector, K represents a key vector, V represents a value vector, the input of each Head is connected, and then operations such as dropout, residual connection, layer regularization and the like are performed, and a specific formula can be represented as follows:
f_R2”=LN(Dropout(Concat(MSA))+f_R2'+xpos)
then, the intermediate vector f _ R is added2”Inputting a feedforward network, and completing operations such as residual connection, layer regularization and the like, wherein a specific formula can be expressed as follows:
F_R2=LN(Dropout(FFN(f_R2”))+f_R2”)
wherein FFM is a feedforward network consisting of two layers of convolutions with convolution kernel size 1, F _ R2Is the final output of the Inter-frame encoder module.
As shown in fig. 6, the CFAM module of interest is responsible for cross-fusion for the present invention. Different modes have characteristic diversity, and fusion operation is carried out through the CFAM module to complete joint representation of double branches. Embedding features of RGB modal tributaries into a vector F _ R2Feature embedding vector F _ D with depth tributary2After the new position information processing, the subsequent operation processing is basically consistent with the Inter-frame encoder module in fig. 5, except that the MSA calculation mode is different, and can be expressed by the following formula:
F_D2_M,F_R2_M=CFAM(F_R2,F_D2)
f _ D generated by CFAM module2_MAnd F _ R2_MAnd multiplying to generate a joint representation F _ M, finishing time sequence correlation extraction by the F _ Cross to generate a new characteristic embedded vector F _ Cross ', adding the new characteristic embedded vector F _ Cross' and the F _ M, inputting a linear layer to finish class mapping, and then carrying out average calculation on results of all frames to obtain a final identification result.
The above network model employs a Cross Entropy (Cross Entropy) loss function. And evaluating the difference degree between the real action label of the video and the prediction result of the current model through a cross entropy loss function, wherein the loss function has a smaller value and represents that the trained model has better effect. The loss function is as follows:
Figure BDA0003498363950000081
in the formula, n represents the number of behavior categories, i represents the real category number of the video data processed by the current network, and yiTrue tags, p, representing corresponding classesiRepresenting the probability values for the corresponding categories predicted by the model. Network losses can optimize network parameters and improve model performance by constantly propagating back-to-back iterations.
The method adopts a Pythrch deep learning framework to train and test on the FPHA data set, and the test standard is the identification accuracy. The results of the experiments are shown in the following table:
Methods Modality Accuracy(%)
HON4D Depth 70.61
Two stream-color RGB 61.56
H+O RGB 82.43
1-layer LSTM Pose 78.73
2-layer LSTM Pose 80.14
Gram Matrix Pose 85.39
Two stream Flow+RGB 75.3
HOG2-depth+pose Pose+Depth 66.78
JOULE-all Pose+RGB+Depth 78.78
Ours RGB+Depth 96.70
from the above table, the effectiveness of the invention can be proved through the comparative evaluation with other mainstream algorithms, the relevant weight embedded in the features can be reasonably distributed through the multi-scale bimodal space-time representation attention mechanism, the space-time semantic representation can be enhanced, and meanwhile, the cross fusion module is combined to obtain higher identification accuracy rate, and meanwhile, the invention has certain practical value.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various changes and modifications can be made without departing from the inventive concept, and these changes and modifications are all within the scope of the present invention.

Claims (10)

1. A first visual angle video motion recognition method is characterized in that a first visual angle video motion data set is input into a multi-scale network based on an RGB mode and a depth mode to extract space semantics, the network adopts a convolutional neural network CNN, two different convolutional blocks in the convolutional neural network CNN are selected to respectively output feature maps of two scales, a first class of feature maps contain certain space information, a second class of feature maps contain rich high-level semantic information, the first class of feature maps are processed through an MCIAM I module, the second class of feature maps are processed through an MCIAM II module to further obtain feature embedded vectors with rich multi-dual-scale mode space semantics, the feature embedded vectors are used as the input of an Inter-frame Encoder module to be processed, the extraction of Inter-frame time sequence relation is completed through the processing of a plurality of Inter-frame Encoder modules, and three feature embedded vectors are obtained, the method comprises the steps that data of an RGB branch and data of a depth branch are fused through a CFAM module, feature embedding vectors of the multi-scale fusion branch are fused, a combined feature embedding vector is generated, the combined feature embedding vector is processed through a linear layer to obtain an action classification result of each frame, then video frames of an action segment are subjected to average processing along a time sequence direction, and a final identification result is output.
2. The method of claim 1, wherein the MCIAM i module partitions the first class of feature maps by mean partition or edge cross partition, maps the feature maps into embedded vectors by feature embedding and linear mapping, adds position information, calculates a weight matrix of the embedded vectors generated by the same partition method as the RGB mode and the depth mode, and performs spatial correlation enhancement between the modes.
3. The method for identifying the motion of the video at the first view angle according to claim 1, wherein the MCIAM ii module calculates the spatial correlation between the RGB mode and the depth mode by using a characteristic segmentation method of horizontal-vertical segmentation, and respectively fuses four embedded vectors generated by the MCIAM i module according to the difference of the modes to complete the multi-scale dual-mode spatial semantic enhancement.
4. The method according to claim 1, wherein the Inter-frame Encoder module uses a trainable linear projection to complete linear mapping to generate a corresponding feature embedded vector, adds a position code to the feature embedded vector for encoding position information of a sequence frame or each Patch, then performs multi-head attention operation to obtain an intermediate vector, and finally inputs the intermediate vector into a feed-forward network and completes residual connection and layer regularization operations.
5. The method for identifying the motion of the video from the first perspective according to claim 1, wherein the degree of the difference between the motion label of the video reality and the prediction result of the current model is evaluated by a cross entropy loss function, wherein the loss function is as follows:
Figure FDA0003498363940000011
where n represents the number of behavior classes, i represents the actual class number to which the video data currently being processed by the network belongs, and y represents the actual class number to which the video data currently being processed by the network belongsiTrue tags, p, representing corresponding classesiRepresenting the probability values for the corresponding categories predicted by the model.
6. The method of claim 1, wherein the data pre-processing and data enhancement processing is performed before the first-view video motion data set is inputted into the network.
7. The first perspective video motion recognition method of claim 1, wherein the convolutional neural network CNN uses a ResNet-34 model pre-trained on an ImageNet data set with a multi-layer convolutional residual block of ResNet-34 as a basic component.
8. A first perspective video motion recognition apparatus, comprising:
a convolutional neural network CNN module for extracting feature information, performing feature extraction of RGB mode and depth mode at 2D feature map level, and outputting the output value of (N)1,N1) And (N)2,N2) Feature maps of two scales;
multimode cross-frame attention module MCIAM I for processing (N)1,N1) The characteristic graph of the scale adopts two different characteristic segmentation modes of average segmentation and edge cross segmentation, then adopts characteristic embedding and linear mapping to map the characteristic graph into embedded vectors, adds position information, calculates a weight matrix of the embedded vectors generated by the same cutting mode of the two modes, and enhances the spatial correlation between the modes;
multimode cross-frame attention module MCIAM II for processing (N)2,N2) The feature map of the scale adopts a feature segmentation mode of horizontal and vertical segmentation to calculate the spatial correlation among the modes, the calculation process is the same as that of the MCIAM I module, and four embedded vectors generated by the MCIAM I module are respectively fused according to the difference of the modes to complete the multi-scale bimodal spatial semantic enhancement;
the Inter-frame Encoder is used for modeling a time process, finishing the processing of action time sequence information by utilizing a self-attention mechanism, modeling the long-term relationship of motion, reasonably distributing the weight of Inter-frame feature embedding, inhibiting the interference of irrelevant objects and objects in a video and distributing more attention resources for a focus area;
and the cross fusion attention module CFAM is used for finishing joint representation of a time sequence network RGB mode and a depth mode, learning a shared structure among different modes through a mutual attention mechanism, and taking charge of final multi-scale space-time information fusion to generate a joint feature embedded vector.
9. The apparatus of claim 8, further comprising a pre-processing module for pre-processing and enhancing the video motion data of the first view by random cropping.
10. A storage medium, in which a computer program is stored, wherein the program is executable by a terminal device or a computer to perform the method of any one of claims 1 to 7.
CN202210120923.5A 2022-02-09 2022-02-09 First visual angle video action identification method and device Pending CN114596520A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210120923.5A CN114596520A (en) 2022-02-09 2022-02-09 First visual angle video action identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210120923.5A CN114596520A (en) 2022-02-09 2022-02-09 First visual angle video action identification method and device

Publications (1)

Publication Number Publication Date
CN114596520A true CN114596520A (en) 2022-06-07

Family

ID=81805156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210120923.5A Pending CN114596520A (en) 2022-02-09 2022-02-09 First visual angle video action identification method and device

Country Status (1)

Country Link
CN (1) CN114596520A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082840A (en) * 2022-08-16 2022-09-20 之江实验室 Action video classification method and device based on data combination and channel correlation
CN115100740A (en) * 2022-06-15 2022-09-23 东莞理工学院 Human body action recognition and intention understanding method, terminal device and storage medium
CN115381467A (en) * 2022-10-31 2022-11-25 浙江浙大西投脑机智能科技有限公司 Attention mechanism-based time-frequency information dynamic fusion decoding method and device
CN115797633A (en) * 2022-12-02 2023-03-14 中国科学院空间应用工程与技术中心 Remote sensing image segmentation method, system, storage medium and electronic equipment
CN115952407A (en) * 2023-01-04 2023-04-11 广东工业大学 Multipath signal identification method considering satellite time sequence and space domain interactivity
CN115984293A (en) * 2023-02-09 2023-04-18 中国科学院空天信息创新研究院 Spatial target segmentation network and method based on edge perception attention mechanism
CN116246075A (en) * 2023-05-12 2023-06-09 武汉纺织大学 Video semantic segmentation method combining dynamic information and static information
CN116434343A (en) * 2023-04-25 2023-07-14 天津大学 Video motion recognition method based on high-low frequency double branches

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100740A (en) * 2022-06-15 2022-09-23 东莞理工学院 Human body action recognition and intention understanding method, terminal device and storage medium
CN115100740B (en) * 2022-06-15 2024-04-05 东莞理工学院 Human motion recognition and intention understanding method, terminal equipment and storage medium
CN115082840B (en) * 2022-08-16 2022-11-15 之江实验室 Action video classification method and device based on data combination and channel correlation
CN115082840A (en) * 2022-08-16 2022-09-20 之江实验室 Action video classification method and device based on data combination and channel correlation
CN115381467A (en) * 2022-10-31 2022-11-25 浙江浙大西投脑机智能科技有限公司 Attention mechanism-based time-frequency information dynamic fusion decoding method and device
CN115797633A (en) * 2022-12-02 2023-03-14 中国科学院空间应用工程与技术中心 Remote sensing image segmentation method, system, storage medium and electronic equipment
CN115952407B (en) * 2023-01-04 2024-01-30 广东工业大学 Multipath signal identification method considering satellite time sequence and airspace interactivity
CN115952407A (en) * 2023-01-04 2023-04-11 广东工业大学 Multipath signal identification method considering satellite time sequence and space domain interactivity
CN115984293A (en) * 2023-02-09 2023-04-18 中国科学院空天信息创新研究院 Spatial target segmentation network and method based on edge perception attention mechanism
CN115984293B (en) * 2023-02-09 2023-11-07 中国科学院空天信息创新研究院 Spatial target segmentation network and method based on edge perception attention mechanism
CN116434343B (en) * 2023-04-25 2023-09-19 天津大学 Video motion recognition method based on high-low frequency double branches
CN116434343A (en) * 2023-04-25 2023-07-14 天津大学 Video motion recognition method based on high-low frequency double branches
CN116246075A (en) * 2023-05-12 2023-06-09 武汉纺织大学 Video semantic segmentation method combining dynamic information and static information

Similar Documents

Publication Publication Date Title
CN114596520A (en) First visual angle video action identification method and device
Fan et al. Point 4d transformer networks for spatio-temporal modeling in point cloud videos
Zhao et al. Learning to forecast and refine residual motion for image-to-video generation
CN111444889A (en) Fine-grained action detection method of convolutional neural network based on multi-stage condition influence
Sanchez-Caballero et al. Exploiting the convlstm: Human action recognition using raw depth video-based recurrent neural networks
Lai et al. Real-time micro-expression recognition based on ResNet and atrous convolutions
CN113870160B (en) Point cloud data processing method based on transformer neural network
CN112906520A (en) Gesture coding-based action recognition method and device
CN114708297A (en) Video target tracking method and device
Liu et al. Dual-stream cross-modality fusion transformer for RGB-D action recognition
CN113705384B (en) Facial expression recognition method considering local space-time characteristics and global timing clues
Liu et al. Data augmentation technology driven by image style transfer in self-driving car based on end-to-end learning
Lv et al. An inverted residual based lightweight network for object detection in sweeping robots
Yuan et al. A novel deep pixel restoration video prediction algorithm integrating attention mechanism
Liu et al. Student behavior recognition from heterogeneous view perception in class based on 3-D multiscale residual dense network for the analysis of case teaching
CN113489958A (en) Dynamic gesture recognition method and system based on video coding data multi-feature fusion
Yang et al. Human action recognition based on skeleton and convolutional neural network
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
CN116453025A (en) Volleyball match group behavior identification method integrating space-time information in frame-missing environment
Qiao et al. Two-Stream Convolutional Neural Network for Video Action Recognition.
Schmeckpeper et al. Object-centric video prediction without annotation
Huang et al. Temporally-aggregating multiple-discontinuous-image saliency prediction with transformer-based attention
Wu et al. Spatial–temporal hypergraph based on dual-stage attention network for multi-view data lightweight action recognition
Cheng et al. Solving monocular sensors depth prediction using MLP-based architecture and multi-scale inverse attention
CN114913342A (en) Motion blurred image line segment detection method and system fusing event and image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination