CN116895038B - Video motion recognition method and device, electronic equipment and readable storage medium - Google Patents
Video motion recognition method and device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN116895038B CN116895038B CN202311162287.3A CN202311162287A CN116895038B CN 116895038 B CN116895038 B CN 116895038B CN 202311162287 A CN202311162287 A CN 202311162287A CN 116895038 B CN116895038 B CN 116895038B
- Authority
- CN
- China
- Prior art keywords
- feature
- features
- video
- frames
- feature extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000003860 storage Methods 0.000 title claims abstract description 21
- 238000000605 extraction Methods 0.000 claims abstract description 77
- 230000009471 action Effects 0.000 claims abstract description 56
- 230000004927 fusion Effects 0.000 claims abstract description 48
- 238000012545 processing Methods 0.000 claims abstract description 18
- 238000013527 convolutional neural network Methods 0.000 claims description 49
- 238000003062 neural network model Methods 0.000 claims description 24
- 230000002123 temporal effect Effects 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 14
- 230000009466 transformation Effects 0.000 claims description 4
- 230000007246 mechanism Effects 0.000 abstract description 8
- 238000004590 computer program Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000004549 pulsed laser deposition Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a video action recognition method, a device, electronic equipment and a readable storage medium, which belong to the technical field of data processing, wherein the method comprises the following steps: extracting a plurality of first frames from the target video sequence, and extracting a second frame from the plurality of first frames; inputting a plurality of first frames into a TPEM (thermal processing unit) for feature extraction to obtain time sequence features; inputting the second frame into the SPEM for feature extraction to obtain spatial features; fusing the time sequence features and the space features to obtain fused features; determining video actions according to the fusion characteristics; TPEM contains a resnet network structure and a transducer network structure, and SPEM contains a resnet network structure. The time space double-branch structure is adopted, the space information and the time information are respectively extracted, and the space information is compared to fuse the information, so that the loss of related information is avoided; the characteristics of the video frames are fused in a multi-scale mode by adopting a resnet network structure, and the attention mechanism in the transformer network structure widens the receptive field, so that the video action recognition is more accurate.
Description
Technical Field
The application belongs to the technical field of data processing, and particularly relates to a video action recognition method, a video action recognition device, electronic equipment and a readable storage medium.
Background
The goal of video motion recognition is to recognize motion occurring in video, which can be seen as a data structure formed by arranging a group of image frames in time sequence, and the motion recognition is to analyze the content of each image in the video and to mine clues from time sequence information among the video frames.
One important dimension of information for motion recognition itself is "timing". If there is no time sequence, only one frame of image is seen, so that "action ambiguity" is easily trapped, for example: for a person bending down, we cannot identify whether the person is sitting or standing up, so it is necessary to determine whether the person is sitting or standing by means of actions taken during the past few frames of the person.
The existing video motion recognition mainly comprises the following methods, but has obvious defects, namely:
1. each frame is framed using a 2D convolutional neural network (Convolutional Neural Networks, CNN) and fused, which ignores the full representation of the timing.
2. Modeling is performed by adopting a 3D CNN mode, and the mode is huge in calculation amount.
3. The method makes up the insufficient expression of the action in time sequence by using the expression of optical flow and the like, but has the characteristics of high difficulty in acquiring the characteristics of optical flow and the like, high resource consumption and low applicability.
Disclosure of Invention
The embodiment of the application provides a video action recognition method, a video action recognition device, electronic equipment and a readable storage medium, which can solve the problem that a high-efficiency and accurate video action recognition method is lacking at present.
In a first aspect, a video action recognition method is provided, including:
extracting a plurality of first frames from the target video sequence, and extracting a second frame from the plurality of first frames;
inputting the plurality of first frames into a time sequence feature extraction module TPEM for feature extraction to obtain time sequence features;
inputting the second frame into a spatial feature extraction module SPEM for feature extraction to obtain spatial features;
fusing the time sequence features and the space features to obtain fusion features;
determining a video action according to the fusion characteristics;
the TPEM comprises a neural network model with a resnet network structure and a neural network model with a transformer network structure, and the SPEM comprises the neural network model with the resnet network structure.
Optionally, the TPEM includes a first convolutional neural network CNN model having a resnet network structure and a second CNN model having a transformer network structure;
the step of inputting the plurality of first frames into the TPEM to perform feature extraction to obtain time sequence features includes:
inputting the plurality of first frames into the first CNN model for feature extraction to obtain first feature data with a plurality of feature codes;
adding category codes to the first characteristic data through first coding processing to obtain second characteristic data;
adding position coding to the second characteristic data through second coding processing to obtain third characteristic data;
inputting the third characteristic data into the second CNN model for characteristic extraction to obtain the time sequence characteristic;
wherein the category code is associated with a category of the video action and the category code is randomly initialized, the position code being associated with a temporal position of each of the first frames in the target video sequence.
Optionally, the adding position coding to the second feature data through a second coding process includes:
the position code is calculated by the following formula:
;
;
wherein,the actual temporal position in the video sequence is encoded for the feature,a position vector encoded for a t-th feature of the plurality of feature encodings,for the value of the i-th element in the position vector, d is the dimension of the feature code,indicating that the i-th element is an even-numbered element,indicating that the i-th element is an odd-numbered element.
Optionally, the SPEM includes a third CNN model having a resnet network structure;
inputting the second frame into a spatial feature extraction module SPEM for feature extraction to obtain spatial features, wherein the method comprises the following steps:
and inputting the second frame into the third CNN model to perform feature extraction to obtain spatial features.
Optionally, the fusing the time sequence feature and the space feature to obtain a fused feature includes:
and performing channel splicing on the time sequence features and the space features to obtain the fusion features.
Optionally, the determining a video action according to the fusion feature includes:
determining the video action according to the fusion characteristics and a preset corresponding relation;
the preset corresponding relation is the corresponding relation between the fusion characteristic and the video action.
In a second aspect, there is provided a video motion recognition apparatus, comprising:
an extraction module for extracting a plurality of first frames from the target video sequence and extracting a second frame from the plurality of first frames;
the first feature extraction module is used for inputting the plurality of first frames into the TPEM to perform feature extraction to obtain time sequence features;
the second feature extraction module is used for inputting the second frame into the SPEM to perform feature extraction to obtain spatial features;
the fusion module is used for fusing the time sequence features and the space features to obtain fusion features;
the determining module is used for determining video actions according to the fusion characteristics;
the TPEM comprises a neural network model with a resnet network structure and a neural network model with a transformer network structure, and the SPEM comprises the neural network model with the resnet network structure.
Optionally, the TPEM includes a first convolutional neural network CNN model having a resnet network structure and a second CNN model having a transformer network structure;
the first feature extraction module is specifically configured to:
inputting the plurality of first frames into the first CNN model for feature extraction to obtain first feature data with a plurality of feature codes;
adding category codes to the first characteristic data through first coding processing to obtain second characteristic data;
adding position coding to the second characteristic data through second coding processing to obtain third characteristic data;
inputting the third characteristic data into the second CNN model for characteristic extraction to obtain the time sequence characteristic;
wherein the category code is associated with a category of the video action and the category code is randomly initialized, the position code being associated with a temporal position of each of the first frames in the target video sequence.
Optionally, the first feature extraction module is specifically configured to:
the position code is calculated by the following formula:
the position code is calculated by the following formula:
;
;
wherein,the actual temporal position in the video sequence is encoded for the feature,a position vector encoded for a t-th feature of the plurality of feature encodings,for the value of the i-th element in the position vector, d is the dimension of the feature code,indicating that the i-th element is an even-numbered element,indicating that the i-th element is an odd-numbered element.
Optionally, the SPEM includes a third CNN model having a resnet network structure;
the second feature extraction module is specifically configured to:
and inputting the second frame into the third CNN model to perform feature extraction to obtain spatial features.
Optionally, the fusion module is specifically configured to:
and performing channel splicing on the time sequence features and the space features to obtain the fusion features.
Optionally, the determining module is specifically configured to:
determining the video action according to the fusion characteristics and a preset corresponding relation;
the preset corresponding relation is the corresponding relation between the fusion characteristic and the video action.
In a third aspect, there is provided an electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method as described in the first aspect.
In a fourth aspect, there is provided a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect.
In a sixth aspect, there is provided a chip comprising a processor and a communication interface coupled to the processor for running a program or instructions to implement the method of the first aspect.
In a seventh aspect, there is provided a computer program/program product stored in a storage medium, the program/program product being executed by at least one processor to implement the method according to the first aspect.
In the embodiment of the application, a plurality of first frames are extracted from a target video sequence, one second frame is extracted from the plurality of first frames, time sequence feature extraction is performed on the plurality of first frames, space feature extraction is performed on the second frames, the extracted time sequence features and the extracted space features are fused, and finally video actions are determined according to the fused features, wherein the TPEM comprises a neural network model with a resnet network structure and a neural network model with a transformer network structure, and the SPEM comprises the neural network model with the resnet network structure. According to the embodiment of the application, a time space double-branch structure is adopted, space information and time information are respectively extracted, and fusion information is conducted on the space information in comparison, so that loss of related information is avoided; the characteristics of the video frames are fused in a multi-scale mode by adopting the resnet network structure, and the information of low-level high resolution and the information of high-level strong semantics are considered, so that the video motion recognition is more efficient, the attention mechanism in the transformer network structure widens the receptive field, the performance of the video motion recognition can be improved to a certain extent, and the video motion recognition is more accurate.
Drawings
Fig. 1 is a flow chart of a video motion recognition method according to an embodiment of the present application;
fig. 2 is a schematic diagram of a module architecture to which the video motion recognition method provided in the embodiment of the present application is applied;
fig. 3 is a schematic structural diagram of a video motion recognition device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.
The terms "first," "second," and the like in this application are used for distinguishing between similar objects and not for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the terms "first" and "second" are generally intended to be used in a generic sense and not to limit the number of objects, for example, the first object may be one or more. Furthermore, "and/or" in this application means at least one of the connected objects. For example, "a or B" encompasses three schemes, scheme one: including a and excluding B; scheme II: including B and excluding a; scheme III: both a and B. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The video motion recognition method provided by the embodiment of the application is described in detail below by some embodiments and application scenes thereof with reference to the accompanying drawings.
Referring to fig. 1, an embodiment of the present application provides a video action recognition method, including:
step 101: a plurality of first frames are extracted from the target video sequence and a second frame is extracted from the plurality of first frames.
Step 102: and inputting the plurality of first frames into the TPEM to perform feature extraction to obtain time sequence features.
Step 103: and inputting the second frame into the SPEM to perform feature extraction to obtain spatial features.
Step 104: and fusing the time sequence features and the space features to obtain fused features.
Step 105: and determining the video action according to the fusion characteristics.
The time sequence feature extraction module (Temporal Embedding, TPEM) comprises a neural network model with a residual (resnet) network structure and a neural network model with a transformation (transducer) network structure, and the space feature extraction module (Spatial Embedding, SPEM) comprises a neural network model with a resnet network structure.
It should be noted that, in the process of extracting the frames in step 101, a plurality of first frames are extracted for extracting the time sequence features, and in consideration of the fact that the contents of the adjacent frames are relatively close, in order to improve the recognition accuracy, an interval extraction method is adopted, specifically, the first frames may be extracted in an interval 1 frame manner, and the first frames may also be called as key frames; accordingly, the second frame is extracted from the plurality of extracted first frames for spatial feature extraction, and an intermediate frame of the plurality of first frames may be generally used as the second frame; the specific setting of the selection of the first frames and the second frames is not limited, for example, the first frames may be extracted by using 2 frames or 3 frames at intervals, and the second frames may be selected from the first half or the second half of the first frames, which may be flexibly set according to actual requirements.
In the embodiment of the application, a plurality of first frames are extracted from a target video sequence, one second frame is extracted from the plurality of first frames, time sequence feature extraction is performed on the plurality of first frames, space feature extraction is performed on the second frames, the extracted time sequence features and the extracted space features are fused, and finally video actions are determined according to the fused features, wherein the TPEM comprises a neural network model with a resnet network structure and a neural network model with a transformer network structure, and the SPEM comprises the neural network model with the resnet network structure. According to the embodiment of the application, a time space double-branch structure is adopted, space information and time information are respectively extracted, and fusion information is conducted on the space information in comparison, so that loss of related information is avoided; the characteristics of the video frames are fused in a multi-scale mode by adopting the resnet network structure, and the information of low-level high resolution and the information of high-level strong semantics are considered, so that the video motion recognition is more efficient, the attention mechanism in the transformer network structure widens the receptive field, the performance of the video motion recognition can be improved to a certain extent, and the video motion recognition is more accurate.
Optionally, the TPEM includes a first convolutional neural network CNN model having a resnet network structure and a second CNN model having a transformer network structure.
Inputting a plurality of first frames into a TPEM for feature extraction to obtain time sequence features, wherein the method comprises the following steps:
(1) And inputting the plurality of first frames into a first CNN model for feature extraction to obtain first feature data with a plurality of feature codes.
The plurality of first frames are input into a CNN model with a resnet network structure, and the residual structure of the resnet network structure can well solve network degradation and combine low-level high-resolution information and high-level strong semantic information. After feature extraction of the first CNN model, first feature data, which may also be referred to as a feature map, is obtained.
(2) And adding category codes to the first characteristic data through the first coding process to obtain second characteristic data.
And (3) coding the feature map output by the first CNN model, wherein class codes (class token) are added, namely one data dimension is added, the class codes are associated with the classes of video actions, the class token is used for classifying the video actions, and the class codes are randomly initialized, so that the class token is not based on image content, therefore, the biasing of a certain specific token can be avoided, and the accuracy of video action recognition is improved.
(3) And adding position coding to the second characteristic data through second coding processing to obtain third characteristic data.
Considering that the position information is lost by the transition structure in the transition network structure, spatial position coding is performed before the transition network is sent in, and the position coding is associated with the temporal position of each first frame in the target video sequence.
(4) And inputting the third characteristic data into a second CNN model to perform characteristic extraction, so as to obtain time sequence characteristics.
The attention mechanism of the transformer network structure is utilized to strengthen the feature expression of the action changing in the time dimension, specifically, the multi-head attention mechanism can be cited, and then the dimension is enlarged and reduced back through a multi-layer perceptron Block (Multilayer Perceptron Block, MLP Block) so as to ensure that the input and output dimension is consistent with the space feature extracted by SPEM. The MLP Block may be included in the transducer network structure or may be independently disposed outside the transducer network structure, which is not particularly limited in the embodiments of the present application.
Optionally, adding position coding to the second feature data by a second coding process includes:
the position code is calculated by the following formula:
;
;
wherein,the actual temporal position in the video sequence is encoded for the feature,a position vector encoded for a t-th feature of the plurality of feature encodings,for the value of the i-th element in the position vector, d is the dimension of the feature code,indicating that the i-th element is an even-numbered element,indicating that the i-th element is an odd-numbered element.
Optionally, a third CNN model with a resnet network structure is included in the SPEM.
Inputting the second frame into a spatial feature extraction module SPEM for feature extraction to obtain spatial features, wherein the method comprises the following steps:
and inputting the second frame into a third CNN model for feature extraction to obtain spatial features.
Considering that the overall video appearance transformation is slow and stable, the second frame can be directly input into the CNN model for feature extraction, wherein the CNN model with a resnet network structure is adopted, the network degradation can be well solved by utilizing the residual structure of the resnet network structure, and the low-level high-resolution information and the high-level high-semantic information are combined.
Optionally, fusing the temporal feature and the spatial feature to obtain a fused feature, including:
and performing channel splicing on the sequence features and the space features to obtain fusion features.
In this embodiment of the present application, a fusion module (CAEM) module may be created to perform feature fusion, but not stacking on a simple channel, specifically, a SPEM module and a TPEM module may be converted into the same shape through a convolution layer, and then channel stitching is performed, so that in order to better fuse features on space and time, an attention module may be newly added after channel stitching, and self-attention is performed on a channel dimension, so that information on time and space dimensions is better fused.
Optionally, determining the video action according to the fusion feature includes:
and determining the video action according to the fusion characteristics and the preset corresponding relation.
The preset corresponding relation is the corresponding relation between the fusion characteristic and the video action.
In the embodiment of the application, the classification result can be output through the convolution layer and the linear mapping, the corresponding relation between the specific fusion characteristics and the video actions can be preset, and the corresponding video actions can be directly obtained after the fusion characteristics are obtained through the process.
The following description of the embodiment of the present application is made with reference to fig. 2, and it should be noted that specific parameters adopted in the following embodiments are examples, and do not limit parameters of the technical solution of the present application.
Referring to fig. 2, a dual-branch structure adopted by the video action recognition method provided by the embodiment of the present application is shown in the figure, where the architecture gives consideration to spatial and temporal feature expression of video features, and a specific scheme flow is as follows:
step one: data preparation.
For the action video, selecting continuous 32 frames, extracting key frames at intervals of 1 frame, preprocessing the extracted 16 frames, inputting the preprocessed 16 frames into a time sequence feature extraction module (TPEM), taking the intermediate frames of the 16 frames as key frames, and inputting the key frames into a space feature extraction module (SPEM).
Step two: and (5) extracting spatial characteristics.
The whole video appearance transformation is slow and stable, so that the 16 frames of intermediate frames are extracted to serve as the spatial feature expression of the video, and the problem that gradient vanishes possibly occurs along with continuous superposition of network depths is considered. Meanwhile, considering that the video motion range is not easy, large-amplitude motion and fine motion exist, the outputs of the different-scale convolution layers cov, cov3, cov4 and cov5 of the resnet are fused, the results of the fusion of the different convolution layers are respectively P2, P3, P4 and P5, then the low-level high-resolution information and the high-level high-semantic information are combined together from top to bottom, the last layer is taken as the output, and finally the dimension reduction is carried out through the convolution layers.
Step three: video timing feature extraction.
The key frames have been processed, but if only the key frames are seen, there is a problem of action ambiguity. To eliminate this problem, it is necessary to consider timing. Thus, in addition to key frames, we consider past frames. Thus, inputting a video segment containing a key frame in addition to the key frame is required. To process this piece of video, a timing feature extraction module TPEM is created from which timing features are extracted, mainly through the following steps.
1. And extracting video frame information.
And sending the extracted 16-frame key frames into a convolutional network for feature extraction, wherein the extracted network structure is the same as the key frame network structure.
2、token embedding。
The feature map extracted in the second step is divided into fixed-size parts, each part is 7*7, then each feature map generates 64 parts with the same size, namely, the length of a token sequence is 64, a class token is added, the class token is mainly used for classifying video actions, the class token is randomly initialized, information on all other tokens is gathered along with continuous updating of training of a network (global feature aggregation), and the bias to a certain specific token can be avoided because the feature map is not based on image content, and the token can avoid the interference of the output by the position code due to the fixed position code, so that image email is finally formed.
3. And (5) time position coding.
The video frames are played in time sequence, and considering that each token is at a different position of the video frame, and the position information is lost by the position structure in the transducer, the video frames are subjected to spatial position coding before being sent into the transducer network, and the position coding is added with token ebedding, so that a specific calculation formula of spatial position embedding is as follows:
;
;
wherein,the actual temporal position in the video sequence is encoded for the feature,a position vector encoded for a t-th feature of the plurality of feature encodings,for the value of the i-th element in the position vector, d is the dimension of the feature code,indicating that the i-th element is an even-numbered element,indicating that the i-th element is an odd-numbered element.
4. Time attention mechanism.
Introducing a time attention mechanism, strengthening the feature expression of the action changing in the time dimension, mapping the result of the step three into q, k and v as input, introducing a multi-head attention mechanism, then amplifying and reducing the dimension back through a mp Block, ensuring that the input and output dimensions are kept consistent, and optionally stacking 6 in total in the structure, and finally outputting the feature expression as the feature expression of the time module TPEM.
Step four: and (5) feature fusion.
The CAEM module is created to perform feature fusion, the SPEM module and the TPEM module are converted into the same shape through a convolution layer instead of stacking on a simple channel, and then channel splicing is performed.
Step five: and outputting a result.
And finally, outputting a classification result through a convolution layer and linear mapping.
According to the video motion recognition method provided by the embodiment of the application, the execution subject can be a video motion recognition device. In the embodiment of the present application, a method for executing a video motion recognition by a video motion recognition device is taken as an example, and the video motion recognition device provided in the embodiment of the present application is described.
Referring to fig. 3, an embodiment of the present application provides a video motion recognition apparatus, including:
an extracting module 301, configured to extract a plurality of first frames from the target video sequence, and extract a second frame from the plurality of first frames;
a first feature extraction module 302, configured to input a plurality of first frames into the TPEM to perform feature extraction, so as to obtain a time sequence feature;
a second feature extraction module 303, configured to input a second frame into the SPEM for feature extraction, so as to obtain a spatial feature;
the fusion module 304 is configured to fuse the time sequence feature and the space feature to obtain a fusion feature;
a determining module 305, configured to determine a video action according to the fusion feature;
the TPEM comprises a neural network model with a resnet network structure and a neural network model with a transformer network structure, and the SPEM comprises the neural network model with the resnet network structure.
Optionally, the TPEM includes a first convolutional neural network CNN model having a resnet network structure and a second CNN model having a transformer network structure;
the first feature extraction module is specifically configured to:
inputting a plurality of first frames into a first CNN model for feature extraction to obtain first feature data with a plurality of feature codes;
adding category codes to the first characteristic data through first coding processing to obtain second characteristic data;
adding position coding to the second characteristic data through second coding processing to obtain third characteristic data;
inputting the third characteristic data into a second CNN model for characteristic extraction to obtain time sequence characteristics;
wherein the category codes are associated with categories of video actions and the category codes are randomly initialized and the position codes are associated with a temporal position of each first frame in the target video sequence.
Optionally, the first feature extraction module is specifically configured to:
the position code is calculated by the following formula:
;
;
wherein,the actual temporal position in the video sequence is encoded for the feature,a position vector encoded for a t-th feature of the plurality of feature encodings,for the value of the i-th element in the position vector, d is the dimension of the feature code,indicating that the i-th element is an even-numbered element,indicating that the i-th element is an odd-numbered element.
Optionally, the SPEM includes a third CNN model having a resnet network structure;
the second feature extraction module is specifically configured to:
and inputting the second frame into a third CNN model for feature extraction to obtain spatial features.
Optionally, the fusion module is specifically configured to:
and performing channel splicing on the sequence features and the space features to obtain fusion features.
Optionally, the determining module is specifically configured to:
determining video actions according to the fusion characteristics and a preset corresponding relation;
the preset corresponding relation is the corresponding relation between the fusion characteristic and the video action.
The video motion recognition device in the embodiment of the application may be an electronic device, for example, an electronic device with an operating system, or may be a component in the electronic device, for example, an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the other device may be a server, network attached storage (Network Attached Storage, NAS), etc., and embodiments of the present application are not specifically limited.
The video action recognition device provided by the embodiment of the application can realize each process realized by the embodiment of the method and achieve the same technical effect, and in order to avoid repetition, the description is omitted here.
Referring to fig. 4, an embodiment of the present invention provides an electronic device 400, including: at least one processor 401, a memory 402, a user interface 403 and at least one network interface 404. The various components in electronic device 400 are coupled together by bus system 405.
It is understood that the bus system 405 is used to enable connected communications between these components. The bus system 405 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 405 in fig. 4.
The user interface 403 may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, a trackball, a touch pad, or a touch screen, etc.).
It will be appreciated that the memory 402 in embodiments of the invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DRRAM). The memory 402 described in embodiments of the present invention is intended to comprise, without being limited to, these and any other suitable types of memory.
In some implementations, the memory 402 stores the following elements, executable modules or data structures, or a subset thereof, or an extended set thereof: an operating system 4021 and application programs 4022.
The operating system 4021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application programs 4022 include various application programs such as a media player, a browser, and the like for implementing various application services. A program for implementing the method of the embodiment of the present invention may be included in the application program 4022.
In an embodiment of the present invention, the electronic device 400 may further include: a program stored on the memory 402 and executable on the processor 401, which when executed by the processor 401, implements the steps of the method provided by the embodiment of the invention.
The method disclosed in the above embodiment of the present invention may be applied to the processor 401 or implemented by the processor 401. The processor 401 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 401 or by instructions in the form of software. The processor 401 described above may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a computer readable storage medium well known in the art such as random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, and the like. The computer readable storage medium is located in a memory 402, and the processor 401 reads information in the memory 402 and performs the steps of the above method in combination with its hardware. In particular, the computer readable storage medium has a computer program stored thereon.
It is to be understood that the embodiments of the invention described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the processing units may be implemented within one or more ASICs, DSPs, digital signal processing Devices (DSPs), programmable logic devices (Programmable Logic Device, PLDs), FPGAs, general purpose processors, controllers, microcontrollers, microprocessors, other electronic units used to perform the functions described herein, or a combination thereof.
The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction realizes each process of the embodiment of the video motion recognition method, and the same technical effect can be achieved, so that repetition is avoided, and no further description is provided herein.
Wherein the processor is a processor in the terminal described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc. In some examples, the readable storage medium may be a non-transitory readable storage medium.
The embodiment of the application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, the processor is used for running a program or an instruction, implementing each process of the video motion recognition method embodiment, and achieving the same technical effect, so as to avoid repetition, and no redundant description is provided herein.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, or the like.
The embodiments of the present application further provide a computer program/program product, where the computer program/program product is stored in a storage medium, and the computer program/program product is executed by at least one processor to implement each process of the embodiments of the video motion recognition method, and the same technical effects can be achieved, so that repetition is avoided, and details are not repeated herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the description of the embodiments above, it will be apparent to those skilled in the art that the above-described example methods may be implemented by means of a computer software product plus a necessary general purpose hardware platform, but may also be implemented by hardware. The computer software product is stored on a storage medium (such as ROM, RAM, magnetic disk, optical disk, etc.) and includes instructions for causing a terminal or network side device to perform the methods described in the various embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms of embodiments may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the protection of the claims, which fall within the protection of the present application.
Claims (12)
1. A method for identifying video actions, comprising:
extracting a plurality of first frames from the target video sequence, and extracting a second frame from the plurality of first frames;
inputting the plurality of first frames into a time sequence feature extraction module TPEM for feature extraction to obtain time sequence features;
inputting the second frame into a spatial feature extraction module SPEM for feature extraction to obtain spatial features;
fusing the time sequence features and the space features to obtain fusion features;
determining a video action according to the fusion characteristics;
the TPEM comprises a neural network model with a residual error network structure and a neural network model with a transformation transformer network structure, and the SPEM comprises the neural network model with the res network structure;
the TPEM comprises a first CNN model with a network structure of a resnet and a second CNN model with a network structure of a transformer;
the step of inputting the plurality of first frames into the TPEM to perform feature extraction to obtain time sequence features includes:
inputting the plurality of first frames into the first CNN model for feature extraction to obtain first feature data with a plurality of feature codes;
adding category codes to the first characteristic data through first coding processing to obtain second characteristic data;
adding position coding to the second characteristic data through second coding processing to obtain third characteristic data;
inputting the third characteristic data into the second CNN model for characteristic extraction to obtain the time sequence characteristic;
wherein the category code is associated with a category of the video action and the category code is randomly initialized, the position code being associated with a temporal position of each of the first frames in the target video sequence.
2. The method according to claim 1, wherein the adding position coding to the second feature data by the second coding process includes:
the position code is calculated by the following formula:
wherein t is the actual temporal position of the feature code in the video sequence, pos t Encoding a t-th feature of the plurality of featuresEncoded position vector pos t (i) For the value of the i-th element in the position vector, d is the dimension of the feature code, i=2k indicates that the i-th element is the even-th element, 2k+1 indicates that the i-th element is the odd-th element.
3. The method of claim 1, wherein the SPEM includes a third CNN model having a resnet network structure;
inputting the second frame into a spatial feature extraction module SPEM for feature extraction to obtain spatial features, wherein the method comprises the following steps:
and inputting the second frame into the third CNN model to perform feature extraction to obtain spatial features.
4. The method of claim 1, wherein the fusing the temporal feature and the spatial feature to obtain a fused feature comprises:
and performing channel splicing on the time sequence features and the space features to obtain the fusion features.
5. The method of claim 1, wherein said determining a video action from said fusion feature comprises:
determining the video action according to the fusion characteristics and a preset corresponding relation;
the preset corresponding relation is the corresponding relation between the fusion characteristic and the video action.
6. A video motion recognition apparatus, comprising:
an extraction module for extracting a plurality of first frames from the target video sequence and extracting a second frame from the plurality of first frames;
the first feature extraction module is used for inputting the plurality of first frames into the TPEM to perform feature extraction to obtain time sequence features;
the second feature extraction module is used for inputting the second frame into the SPEM to perform feature extraction to obtain spatial features;
the fusion module is used for fusing the time sequence features and the space features to obtain fusion features;
the determining module is used for determining video actions according to the fusion characteristics;
the TPEM comprises a neural network model with a resnet network structure and a neural network model with a transformer network structure, and the SPEM comprises the neural network model with the resnet network structure;
the TPEM comprises a first CNN model with a network structure of a resnet and a second CNN model with a network structure of a transformer;
the first feature extraction module is specifically configured to:
inputting the plurality of first frames into the first CNN model for feature extraction to obtain first feature data with a plurality of feature codes;
adding category codes to the first characteristic data through first coding processing to obtain second characteristic data;
adding position coding to the second characteristic data through second coding processing to obtain third characteristic data;
inputting the third characteristic data into the second CNN model for characteristic extraction to obtain the time sequence characteristic;
wherein the category code is associated with a category of the video action and the category code is randomly initialized, the position code being associated with a temporal position of each of the first frames in the target video sequence.
7. The apparatus of claim 6, wherein the first feature extraction module is specifically configured to:
the position code is calculated by the following formula:
wherein t is the actual temporal position of the feature code in the video sequence, pos t Position vector, pos, encoded for the t-th feature of the plurality of feature encodings t (i) For the value of the i-th element in the position vector, d is the dimension of the feature code, i=2k indicates that the i-th element is the even-th element, 2k+1 indicates that the i-th element is the odd-th element.
8. The apparatus of claim 6, wherein the SPEM includes a third CNN model having a resnet network structure;
the second feature extraction module is specifically configured to:
and inputting the second frame into the third CNN model to perform feature extraction to obtain spatial features.
9. The device according to claim 6, wherein the fusion module is specifically configured to:
and performing channel splicing on the time sequence features and the space features to obtain the fusion features.
10. The apparatus according to claim 6, wherein the determining module is specifically configured to:
determining the video action according to the fusion characteristics and a preset corresponding relation;
the preset corresponding relation is the corresponding relation between the fusion characteristic and the video action.
11. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the video action recognition method of any one of claims 1 to 5.
12. A readable storage medium, wherein a program or instructions is stored on the readable storage medium, which when executed by a processor, implements the steps of the video action recognition method according to any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311162287.3A CN116895038B (en) | 2023-09-11 | 2023-09-11 | Video motion recognition method and device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311162287.3A CN116895038B (en) | 2023-09-11 | 2023-09-11 | Video motion recognition method and device, electronic equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116895038A CN116895038A (en) | 2023-10-17 |
CN116895038B true CN116895038B (en) | 2024-01-26 |
Family
ID=88311127
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311162287.3A Active CN116895038B (en) | 2023-09-11 | 2023-09-11 | Video motion recognition method and device, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116895038B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113673489A (en) * | 2021-10-21 | 2021-11-19 | 之江实验室 | Video group behavior identification method based on cascade Transformer |
CN115019239A (en) * | 2022-07-04 | 2022-09-06 | 福州大学 | Real-time action positioning method based on space-time cross attention |
CN116453025A (en) * | 2023-05-11 | 2023-07-18 | 南京邮电大学 | Volleyball match group behavior identification method integrating space-time information in frame-missing environment |
CN116580453A (en) * | 2023-04-26 | 2023-08-11 | 哈尔滨工程大学 | Human body behavior recognition method based on space and time sequence double-channel fusion model |
CN116703980A (en) * | 2023-08-04 | 2023-09-05 | 南昌工程学院 | Target tracking method and system based on pyramid pooling transducer backbone network |
-
2023
- 2023-09-11 CN CN202311162287.3A patent/CN116895038B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113673489A (en) * | 2021-10-21 | 2021-11-19 | 之江实验室 | Video group behavior identification method based on cascade Transformer |
CN115019239A (en) * | 2022-07-04 | 2022-09-06 | 福州大学 | Real-time action positioning method based on space-time cross attention |
CN116580453A (en) * | 2023-04-26 | 2023-08-11 | 哈尔滨工程大学 | Human body behavior recognition method based on space and time sequence double-channel fusion model |
CN116453025A (en) * | 2023-05-11 | 2023-07-18 | 南京邮电大学 | Volleyball match group behavior identification method integrating space-time information in frame-missing environment |
CN116703980A (en) * | 2023-08-04 | 2023-09-05 | 南昌工程学院 | Target tracking method and system based on pyramid pooling transducer backbone network |
Non-Patent Citations (2)
Title |
---|
Two-Stream Transformer Architecture for Long Form Video Understanding;Edward Fish 等;《arxiv》;第1-14页 * |
ViViT: A Video Vision Transformer;A Arnab 等;《IEEE/CVF International Conference on Computer Vision》;第6816-6826页 * |
Also Published As
Publication number | Publication date |
---|---|
CN116895038A (en) | 2023-10-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Guo et al. | Eaten: Entity-aware attention for single shot visual text extraction | |
JP5128629B2 (en) | Part-of-speech tagging system, part-of-speech tagging model training apparatus and method | |
CN110196967B (en) | Sequence labeling method and device based on depth conversion architecture | |
WO2021098689A1 (en) | Text recognition method for natural scene, storage apparatus, and computer device | |
Deng et al. | Transvg++: End-to-end visual grounding with language conditioned vision transformer | |
KR102165134B1 (en) | Methods and systems for using state vector data in a state machine engine | |
CN108200483A (en) | Dynamically multi-modal video presentation generation method | |
WO2023202197A1 (en) | Text recognition method and related apparatus | |
US11562734B2 (en) | Systems and methods for automatic speech recognition based on graphics processing units | |
CN112232052B (en) | Text splicing method, text splicing device, computer equipment and storage medium | |
CN110619124A (en) | Named entity identification method and system combining attention mechanism and bidirectional LSTM | |
Zhang et al. | Dilated temporal relational adversarial network for generic video summarization | |
CN109634578B (en) | Program generation method based on text description | |
CN113901909A (en) | Video-based target detection method and device, electronic equipment and storage medium | |
CN109766918A (en) | Conspicuousness object detecting method based on the fusion of multi-level contextual information | |
CN110502236B (en) | Front-end code generation method, system and equipment based on multi-scale feature decoding | |
CN114529761B (en) | Video classification method, device, equipment, medium and product based on classification model | |
Cao et al. | GMN: generative multi-modal network for practical document information extraction | |
CN113688207B (en) | Modeling processing method and device based on structural reading understanding of network | |
CN113157941B (en) | Service characteristic data processing method, service characteristic data processing device, text generating method, text generating device and electronic equipment | |
CN116895038B (en) | Video motion recognition method and device, electronic equipment and readable storage medium | |
CN117593400A (en) | Image generation method, model training method and corresponding devices | |
CN113535912A (en) | Text association method based on graph convolution network and attention mechanism and related equipment | |
CN111325068B (en) | Video description method and device based on convolutional neural network | |
CN112699882A (en) | Image character recognition method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |