CN117409479A - Multi-label action recognition method based on simultaneous and sequential action relation modeling - Google Patents
Multi-label action recognition method based on simultaneous and sequential action relation modeling Download PDFInfo
- Publication number
- CN117409479A CN117409479A CN202311407105.4A CN202311407105A CN117409479A CN 117409479 A CN117409479 A CN 117409479A CN 202311407105 A CN202311407105 A CN 202311407105A CN 117409479 A CN117409479 A CN 117409479A
- Authority
- CN
- China
- Prior art keywords
- action
- feature
- simultaneous
- relation
- graphs
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009471 action Effects 0.000 title claims abstract description 154
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000000605 extraction Methods 0.000 claims abstract description 15
- 230000004927 fusion Effects 0.000 claims abstract description 8
- 230000008569 process Effects 0.000 claims abstract description 5
- 238000010586 diagram Methods 0.000 claims description 15
- 230000001364 causal effect Effects 0.000 claims description 9
- 230000033001 locomotion Effects 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 9
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a multi-label action recognition method based on simultaneous and sequential action relation modeling, which comprises the steps of obtaining a video to be recognized, and carrying out feature extraction based on a feature extraction network to obtain a first feature map of a continuous multi-frame; in the simultaneous action relation modeling module, first feature graphs are input, the relation of simultaneous actions is modeled based on the space information of each first feature graph, and a plurality of second feature graphs with the relation of simultaneous actions are output; and inputting a plurality of second feature graphs with the simultaneous relation of actions in a sequence action relation modeling module, and classifying and predicting a final action result based on a mode of feature fusion between the plurality of second feature graphs and a plurality of third feature graphs with the sequence action relation obtained in an intermediate process. By implementing the method and the device, the problem that the time sequence relation is ignored only by considering the action co-occurrence relation in the prior art can be solved.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to a multi-label action recognition method based on simultaneous and sequential action relation modeling.
Background
With the rapid development of computer vision, video motion recognition based on a deep convolutional neural network is increasingly focused by researchers, and motion recognition algorithms are gradually applied to the fields of video monitoring and the like. Although the accuracy of single motion recognition has gained a great breakthrough at present, in real-world scenes, the captured video often contains multiple motions.
In recent years, researchers have gradually shifted the center of gravity to multi-labeled action videos with the goal of identifying all actions in the video. However, since different actions in the video may occur on different objects and the time at which the different actions occur varies, this presents a significant challenge for multi-tag action recognition.
Currently, the research direction of the mainstream multi-tag action recognition task is to split the multi-tag into a plurality of single-tag tasks to perform classification recognition. Although these methods can also bring about performance improvement for multi-tag action recognition to some extent due to the performance improvement of the feature extractor, the relation between actions is not considered, resulting in performance limitation. To solve this problem, researchers have focused on establishing relationships between actions to reduce search space and thereby improve recognition performance, and a main method is to use a multi-tag image recognition method to establish a graph structure using co-occurrence of actions and then capture relationships of actions using the graph structure. Such methods, while relatively intuitive, have timing relationships for actions, and the timing relationships cannot be arbitrarily disturbed, so that accurate action relationships cannot be constructed.
Therefore, a new multi-tag action recognition method is needed to solve the problem that only the action co-occurrence relationship is considered and the time sequence relationship is ignored in the prior art.
Disclosure of Invention
The technical problem to be solved by the embodiment of the invention is to provide a multi-label action recognition method based on simultaneous and sequential action relation modeling, which can solve the problem that the time sequence relation is ignored only by considering the action co-occurrence relation in the prior art.
In order to solve the technical problems, the embodiment of the invention provides a multi-label action recognition method based on simultaneous and sequential action relation modeling, which comprises the following steps:
acquiring a video to be identified, and carrying out feature extraction based on a preset feature extraction network to obtain a first feature map of a continuous multi-frame;
in a preset simultaneous action relation modeling module, inputting the first feature graphs of the obtained continuous multi-frame, modeling the relation of simultaneous actions based on the spatial information of each first feature graph, and outputting a plurality of second feature graphs with the relation of simultaneous actions;
and in a preset sequence action relation modeling module, inputting a plurality of second feature graphs with action simultaneous relations, and classifying and predicting a final action result based on a mode of feature fusion of the plurality of second feature graphs with the action simultaneous relations and a plurality of third feature graphs with the sequence action relations obtained in an intermediate process.
Wherein the feature extraction network is an X3D series feature extractorThe first characteristic diagram obtained by the output of the method passes through the formulaTo represent; wherein,
f backbone (. Cndot.) represents a feature extractor; x represents a first feature map of each frame; H. w and D respectively represent the height, width and channel number of the first feature map of each frame; v= { V 1 ,v 2 ,...,v T -representing the video to be identified; t represents the total number of frames of the video; y= { y 1 ,…,y C -representing action category labels of the video to be identified, C being the number of actions present in the dataset, y i =1 indicates a video existence action i.
The simultaneous action relation modeling module is a transducer decoder, and consists of a self-attention network and a forward propagation network; wherein,
the simultaneous action relation modeling module obtains a plurality of second feature graphs with the simultaneous action relation by executing the following steps:
passing the size of the first feature map of the obtained continuous multiframe through a formulaPerforming space position adjustment; wherein f reshape (. Cndot.) represents a resizing function that accepts two inputs, a first feature map that requires resizing and the resized dimensions;
after the size of the first feature map of each frame is subjected to space position adjustment, capturing a space relation of simultaneous actions by utilizing a self-attention network in a transducer encoder, and further mapping the dimension back to an input size by utilizing a forward propagation network in the transducer encoder so as to obtain a plurality of second feature maps with built action simultaneous relations;
by X simu =f reshape (f simu (X fsim ) T×d×h×w), readjusting the sizes of the plurality of second feature maps after the simultaneous relationship of the constructed actions, and outputting the readjusted sizes; wherein f fsim (. Cndot.) represents a standard transducer encoder.
The sequence action relation modeling module utilizes a causal self-attention mechanism to construct an action sequence relation; wherein,
the sequence action relation modeling module predicts a final action result by executing the following steps:
by the formulaAdding position embedding vectors into the plurality of second feature images with the motion simultaneous relations after the size adjustment so as to keep the position information of each frame; wherein p represents a learnable position embedding vector;
by the formulaPerforming size adjustment on a plurality of second feature graphs added with the position embedding vectors in the time dimension; wherein (1)>Representing the sequence relation of two-frame actions in the capturing group by performing two-by-two grouping on the time dimension;
construction of mask matrixAnd pass through the formulaRealizing a causal self-attention mechanism, so that each group of information can only flow from the second characteristic diagram of the previous frame to the second characteristic diagram of the next frame to obtain a plurality of groups of second characteristic diagrams; wherein f Q (·),f K (·),f V (-) represents three linear mapping functions; d, d k Represents a scaling factor, f softmax (. Cndot.) represents a softmax function;
by the formulaAggregating each group of second feature graphs, and obtaining a plurality of third feature graphs with sequence action relation by utilizing causal self-attention and feature aggregation until the time dimension is reduced to 1>Wherein α represents a superparameter,/->Representing a second feature map after aggregation;
feature fusion is carried out on a plurality of second feature graphs with action simultaneous relations and a plurality of third feature graphs with sequence action relations by utilizing residual errors, and a formula X is utilized seq =f reshape (X′ seq +f mean (X fseq ) D×h×w), and performing size adjustment; wherein f mean (. Cndot.) represents the mean function;
by the formulaClassifying the fused feature images, and predicting a final action result; wherein f fc (. Cndot.) represents a fully attached layer, f gap (-) represents global pooling; />Representing the confidence of the prediction.
Wherein the prediction confidenceThe loss function of (2) is:
wherein σ (·) represents the sigmoid function.
The embodiment of the invention has the following beneficial effects:
the invention not only provides a simultaneous action relation modeling module to effectively construct the action simultaneous relation, but also provides a sequential action relation modeling module to improve the defect that the original self-attention cannot capture the sequence information, constraint information can only be transmitted from front to back, and the action sequence relation is effectively constructed, so that the problem that the time sequence relation is ignored by only considering the action co-occurrence relation in the prior art can be solved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that it is within the scope of the invention to one skilled in the art to obtain other drawings from these drawings without inventive faculty.
FIG. 1 is a flowchart of a multi-tag action recognition method based on simultaneous and sequential action relationship modeling provided by an embodiment of the present invention;
FIG. 2 is a logic diagram of a simultaneous action relation modeling module in a multi-label action recognition method based on simultaneous and sequential action relation modeling according to an embodiment of the present invention;
FIG. 3 is a visual result diagram of capturing a simultaneous relationship using the action simultaneous relationship modeling module of FIG. 2;
FIG. 4 is a logic diagram of a sequential action relation modeling module in a multi-label action recognition method based on simultaneous and sequential action relation modeling according to an embodiment of the present invention;
FIG. 5 is a graph of a visual result of capturing a sequence relationship using the sequence action relationship modeling module of FIG. 4;
fig. 6 is an overall logic diagram of a multi-tag action recognition method based on simultaneous and sequential action relationship modeling according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent.
As shown in fig. 1, in an embodiment of the present invention, a multi-tag action recognition method based on simultaneous and sequential action relationship modeling is provided, where the method includes the following steps:
step S1, acquiring a video to be identified, and carrying out feature extraction based on a preset feature extraction network to obtain a first feature map of a continuous multi-frame;
firstly, acquiring a video to be identified.
Secondly, carrying out feature extraction on the video to be identified by a predefined feature extraction network to obtain a first feature map of a continuous multi-frame; wherein the feature extraction network is an X3D series feature extractor, and the first feature image obtained by outputting the feature extraction network is represented by the formulaTo represent; wherein f backbone (. Cndot.) represents a feature extractor; x represents a first feature map of each frame; H. w and D respectively represent the height, width and channel number of the first feature map of each frame; v= { V 1 ,v 2 ,...,v T -representing a video to be identified; t represents the total number of frames of the video; y= { y 1 ,…,y C -representing action category labels of the video to be identified, C being the number of actions present in the dataset, y i =1 indicates a video existence action i.
S2, in a preset simultaneous action relation modeling module, inputting the first feature graphs of the obtained continuous multi-frame, modeling the relation of simultaneous actions based on the spatial information of each first feature graph, and outputting a plurality of second feature graphs with the relation of simultaneous actions;
firstly, predefining a simultaneous action relation modeling module; the simultaneous action relationship modeling module is a transducer decoder, which is composed of a self-attention network and a forward propagation network, as shown in fig. 2.
Secondly, leading the first feature images of the continuous multi-frame into a simultaneous action relation modeling module, modeling the relation of simultaneous actions based on the space information of each first feature image, and outputting a plurality of second feature images with the relation of simultaneous actions, wherein the second feature images are specifically as follows:
passing the size of the first feature map of the obtained continuous multiframe through a formulaPerforming space position adjustment; wherein f reshape (. Cndot.) represents a resizing function that accepts two inputs, a first feature map that requires resizing and the resized dimensions;
after the size of the first feature map of each frame is subjected to space position adjustment, capturing a space relation of simultaneous actions by utilizing a self-attention network in a transducer encoder, and further mapping the dimension back to an input size by utilizing a forward propagation network in the transducer encoder so as to obtain a plurality of second feature maps with built action simultaneous relations;
by X simu =f reshape (f simu (X fsim ) T×d×h×w), and readjusting the sizes of the plurality of second feature maps obtained by the simultaneous relationship of the constructed actions, and outputting the readjusted sizes; wherein f fsim (. Cndot.) represents a standard transducer encoder.
In one example, as shown in FIG. 3, a graph of simultaneous relationship visualization results is captured. In fig. 3, it can be seen that the contemporaneous action that was not otherwise activated is activated after the contemporaneous relationship is captured.
And S3, inputting the obtained second feature graphs with the action simultaneous relationship into a preset sequence action relationship modeling module, and classifying and predicting a final action result based on a feature fusion mode of the input second feature graphs with the action simultaneous relationship and the third feature graphs with the sequence action relationship obtained in the middle process.
The specific process is that, first, a predefined sequence action relation modeling module is used for constructing action sequence relations by using a causal self-attention mechanism, and capturing the sequence relations of actions at a time sequence level, as shown in fig. 4.
Secondly, a plurality of second feature graphs with action simultaneous relations are imported into a sequence action relation modeling module to conduct feature extraction, fusion and classification, and multi-label action types are predicted, wherein the method specifically comprises the following steps:
by the formulaAdding position embedding vectors into the plurality of second feature images with the motion simultaneous relations after the size adjustment so as to keep the position information of each frame; wherein p represents a learnable position embedding vector;
by the formulaPerforming size adjustment on a plurality of second feature graphs added with the position embedding vectors in the time dimension; wherein (1)>Representing the sequence relation of two-frame actions in the capturing group by performing two-by-two grouping on the time dimension;
construction of mask matrixAnd pass through the formulaRealizing a causal self-attention mechanism, so that each group of information can only flow from the second characteristic diagram of the previous frame to the second characteristic diagram of the next frame to obtain a plurality of groups of second characteristic diagrams; wherein f Q (·),f K (·),f V (-) represents three linear mapping functions; d, d k Represents a scaling factor, f softmax (. Cndot.) represents a softmax function;
by the formulaAggregating each group of second feature graphs, and obtaining a plurality of third feature graphs with sequence action relation by utilizing causal self-attention and feature aggregation until the time dimension is reduced to 1>Wherein α represents a superparameter,/->Representing a second feature map after aggregation;
feature fusion is carried out on a plurality of second feature graphs with action simultaneous relations and a plurality of third feature graphs with sequence action relations by utilizing residual errors, and a formula X is utilized seq =f reshape (X′ seq +f mean (X fseq ) D×h×w), and performing size adjustment; wherein f mean (. Cndot.) represents the mean function;
by the formulaClassifying the fused feature images, and predicting a final action result; wherein f fc (. Cndot.) represents a fully attached layer, f gap (-) represents global pooling; />Representing the confidence of the prediction.
It should be noted that the prediction confidenceThe loss function of (2) is:
wherein σ (·) represents the sigmoid function.
In one example, as shown in FIG. 5, a result graph is visualized for the captured sequence relationships. In fig. 5, it can be seen that after capturing the sequence relationships, the actions that were not otherwise activated occur in tandem are activated.
At this time, an overall logic diagram of a multi-tag action recognition method based on simultaneous and sequential action relationship modeling in the embodiment of the present invention is shown in fig. 6.
The embodiment of the invention has the following beneficial effects:
the invention not only provides a simultaneous action relation modeling module to effectively construct the action simultaneous relation, but also provides a sequential action relation modeling module to improve the defect that the original self-attention cannot capture the sequence information, constraint information can only be transmitted from front to back, and the action sequence relation is effectively constructed, so that the problem that the time sequence relation is ignored by only considering the action co-occurrence relation in the prior art can be solved.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc.
The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.
Claims (5)
1. A multi-tag action recognition method based on simultaneous and sequential action relationship modeling, the method comprising the steps of:
acquiring a video to be identified, and carrying out feature extraction based on a preset feature extraction network to obtain a first feature map of a continuous multi-frame;
in a preset simultaneous action relation modeling module, inputting the first feature graphs of the obtained continuous multi-frame, modeling the relation of simultaneous actions based on the spatial information of each first feature graph, and outputting a plurality of second feature graphs with the relation of simultaneous actions;
and in a preset sequence action relation modeling module, inputting a plurality of second feature graphs with action simultaneous relations, and classifying and predicting a final action result based on a mode of feature fusion of the plurality of second feature graphs with the action simultaneous relations and a plurality of third feature graphs with the sequence action relations obtained in an intermediate process.
2. The multi-label motion recognition method based on simultaneous and sequential motion relationship modeling of claim 1, wherein the feature extraction network is an X3D series of feature extractors that output a resulting first feature map through a formulaTo represent; wherein f backbone (. Cndot.) represents a feature extractor; x represents a first feature map of each frame; H. w and D respectively represent the height, width and channel number of the first feature map of each frame; v= { V 1 ,v 2 ,...,v T -representing the video to be identified; t represents the total number of frames of the video; y= { y 1 ,…,y C -representing action category labels of the video to be identified, C being the number of actions present in the dataset, y i =1 indicates a video existence action i.
3. The multi-tag action recognition method based on simultaneous and sequential action relationship modeling of claim 2, wherein said simultaneous action relationship modeling module is a transducer decoder, which is comprised of a self-attention network and a forward propagation network; wherein,
the simultaneous action relation modeling module obtains a plurality of second feature graphs with the simultaneous action relation by executing the following steps:
passing the size of the first feature map of the obtained continuous multiframe through a formulaPerforming space position adjustment; wherein f reshape (. Cndot.) represents a resizing function that accepts two inputs, a first feature map that requires resizing and the resized dimensions;
after the size of the first feature map of each frame is subjected to space position adjustment, capturing a space relation of simultaneous actions by utilizing a self-attention network in a transducer encoder, and further mapping the dimension back to an input size by utilizing a forward propagation network in the transducer encoder so as to obtain a plurality of second feature maps with built action simultaneous relations;
by X simu =f reshape (f simu (X fsim ) T×d×h×w), readjusting the sizes of the plurality of second feature maps after the simultaneous relationship of the constructed actions, and outputting the readjusted sizes; wherein f fsim (. Cndot.) represents a standard transducer encoder.
4. The multi-tag action recognition method based on simultaneous and sequential action relationship modeling of claim 3, wherein said sequential action relationship modeling module utilizes a causal self-attention mechanism to construct an action sequence relationship; wherein,
the sequence action relation modeling module predicts a final action result by executing the following steps:
by the formulaAdding position embedding vectors into the plurality of second feature images with the motion simultaneous relations after the size adjustment so as to keep the position information of each frame; wherein p represents a learnable position embedding vector;
by the formulaPerforming size adjustment on a plurality of second feature graphs added with the position embedding vectors in the time dimension; wherein (1)>Representing the sequence relation of two-frame actions in the capturing group by performing two-by-two grouping on the time dimension;
construction of mask matrixAnd pass through the formulaRealizing a causal self-attention mechanism, so that each group of information can only flow from the second characteristic diagram of the previous frame to the second characteristic diagram of the next frame to obtain a plurality of groups of second characteristic diagrams; wherein f Q (·),f K (·),f V (-) represents three linear mapping functions; d, d k Represents a scaling factor, f spftmax (. Cndot.) represents a softmax function;
by the formulaAggregating each group of second feature graphs, and obtaining a plurality of third feature graphs with sequence action relation by utilizing causal self-attention and feature aggregation until the time dimension is reduced to 1>Wherein α represents a superparameter,/->Representing a second feature map after aggregation;
feature fusion is carried out on a plurality of second feature graphs with action simultaneous relations and a plurality of third feature graphs with sequence action relations by utilizing residual errors, and a formula X is utilized seq =f reshape (X′ seq +f mean (X fseq ) D×h×w), and performing size adjustment; wherein f mean (. Cndot.) represents the mean function;
by the formulaClassifying the fused feature images, and predicting a final action result; wherein f fc (. Cndot.) represents a fully attached layer, f gap (-) represents global pooling; />Representing the confidence of the prediction.
5. The multi-tag action recognition method based on simultaneous and sequential action relationship modeling of claim 4, wherein the predictive confidence levelThe loss function of (2) is:
wherein σ (·) represents the sigmoid function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311407105.4A CN117409479A (en) | 2023-10-26 | 2023-10-26 | Multi-label action recognition method based on simultaneous and sequential action relation modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311407105.4A CN117409479A (en) | 2023-10-26 | 2023-10-26 | Multi-label action recognition method based on simultaneous and sequential action relation modeling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117409479A true CN117409479A (en) | 2024-01-16 |
Family
ID=89492222
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311407105.4A Pending CN117409479A (en) | 2023-10-26 | 2023-10-26 | Multi-label action recognition method based on simultaneous and sequential action relation modeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117409479A (en) |
-
2023
- 2023-10-26 CN CN202311407105.4A patent/CN117409479A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111210443B (en) | Deformable convolution mixing task cascading semantic segmentation method based on embedding balance | |
CN111639692B (en) | Shadow detection method based on attention mechanism | |
CN110175580B (en) | Video behavior identification method based on time sequence causal convolutional network | |
Zhang et al. | Self-supervised visual representation learning from hierarchical grouping | |
CN111968150B (en) | Weak surveillance video target segmentation method based on full convolution neural network | |
CN110569814B (en) | Video category identification method, device, computer equipment and computer storage medium | |
CN110781850A (en) | Semantic segmentation system and method for road recognition, and computer storage medium | |
Li et al. | Joint semantic-instance segmentation method for intelligent transportation system | |
CN110705412A (en) | Video target detection method based on motion history image | |
CN111563507A (en) | Indoor scene semantic segmentation method based on convolutional neural network | |
CN111768415A (en) | Image instance segmentation method without quantization pooling | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN113628244A (en) | Target tracking method, system, terminal and medium based on label-free video training | |
CN114494981A (en) | Action video classification method and system based on multi-level motion modeling | |
WO2023036157A1 (en) | Self-supervised spatiotemporal representation learning by exploring video continuity | |
CN113850135A (en) | Dynamic gesture recognition method and system based on time shift frame | |
CN110532959B (en) | Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network | |
CN115482387A (en) | Weak supervision image semantic segmentation method and system based on multi-scale class prototype | |
CN115908793A (en) | Coding and decoding structure semantic segmentation model based on position attention mechanism | |
CN110942463B (en) | Video target segmentation method based on generation countermeasure network | |
CN112419352B (en) | Small sample semantic segmentation method based on contour | |
Li | A deep learning-based text detection and recognition approach for natural scenes | |
US20240062347A1 (en) | Multi-scale fusion defogging method based on stacked hourglass network | |
CN117315752A (en) | Training method, device, equipment and medium for face emotion recognition network model | |
CN112016434A (en) | Lens motion identification method based on attention mechanism 3D residual error network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |