CN117409479A

CN117409479A - Multi-label action recognition method based on simultaneous and sequential action relation modeling

Info

Publication number: CN117409479A
Application number: CN202311407105.4A
Authority: CN
Inventors: 陈钊民; 张笑钦; 夏超群; 葛一粟; 章国道
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-01-16

Abstract

The invention provides a multi-label action recognition method based on simultaneous and sequential action relation modeling, which comprises the steps of obtaining a video to be recognized, and carrying out feature extraction based on a feature extraction network to obtain a first feature map of a continuous multi-frame; in the simultaneous action relation modeling module, first feature graphs are input, the relation of simultaneous actions is modeled based on the space information of each first feature graph, and a plurality of second feature graphs with the relation of simultaneous actions are output; and inputting a plurality of second feature graphs with the simultaneous relation of actions in a sequence action relation modeling module, and classifying and predicting a final action result based on a mode of feature fusion between the plurality of second feature graphs and a plurality of third feature graphs with the sequence action relation obtained in an intermediate process. By implementing the method and the device, the problem that the time sequence relation is ignored only by considering the action co-occurrence relation in the prior art can be solved.

Description

Multi-label action recognition method based on simultaneous and sequential action relation modeling

Technical Field

The invention relates to the technical field of image processing, in particular to a multi-label action recognition method based on simultaneous and sequential action relation modeling.

Background

With the rapid development of computer vision, video motion recognition based on a deep convolutional neural network is increasingly focused by researchers, and motion recognition algorithms are gradually applied to the fields of video monitoring and the like. Although the accuracy of single motion recognition has gained a great breakthrough at present, in real-world scenes, the captured video often contains multiple motions.

In recent years, researchers have gradually shifted the center of gravity to multi-labeled action videos with the goal of identifying all actions in the video. However, since different actions in the video may occur on different objects and the time at which the different actions occur varies, this presents a significant challenge for multi-tag action recognition.

Currently, the research direction of the mainstream multi-tag action recognition task is to split the multi-tag into a plurality of single-tag tasks to perform classification recognition. Although these methods can also bring about performance improvement for multi-tag action recognition to some extent due to the performance improvement of the feature extractor, the relation between actions is not considered, resulting in performance limitation. To solve this problem, researchers have focused on establishing relationships between actions to reduce search space and thereby improve recognition performance, and a main method is to use a multi-tag image recognition method to establish a graph structure using co-occurrence of actions and then capture relationships of actions using the graph structure. Such methods, while relatively intuitive, have timing relationships for actions, and the timing relationships cannot be arbitrarily disturbed, so that accurate action relationships cannot be constructed.

Therefore, a new multi-tag action recognition method is needed to solve the problem that only the action co-occurrence relationship is considered and the time sequence relationship is ignored in the prior art.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is to provide a multi-label action recognition method based on simultaneous and sequential action relation modeling, which can solve the problem that the time sequence relation is ignored only by considering the action co-occurrence relation in the prior art.

In order to solve the technical problems, the embodiment of the invention provides a multi-label action recognition method based on simultaneous and sequential action relation modeling, which comprises the following steps:

acquiring a video to be identified, and carrying out feature extraction based on a preset feature extraction network to obtain a first feature map of a continuous multi-frame;

in a preset simultaneous action relation modeling module, inputting the first feature graphs of the obtained continuous multi-frame, modeling the relation of simultaneous actions based on the spatial information of each first feature graph, and outputting a plurality of second feature graphs with the relation of simultaneous actions;

and in a preset sequence action relation modeling module, inputting a plurality of second feature graphs with action simultaneous relations, and classifying and predicting a final action result based on a mode of feature fusion of the plurality of second feature graphs with the action simultaneous relations and a plurality of third feature graphs with the sequence action relations obtained in an intermediate process.

Wherein the feature extraction network is an X3D series feature extractorThe first characteristic diagram obtained by the output of the method passes through the formulaTo represent; wherein,

f _backbone (. Cndot.) represents a feature extractor; x represents a first feature map of each frame; H. w and D respectively represent the height, width and channel number of the first feature map of each frame; v= { V ₁ ,v ₂ ,...,v _T -representing the video to be identified; t represents the total number of frames of the video; y= { y ₁ ,…,y _C -representing action category labels of the video to be identified, C being the number of actions present in the dataset, y _i =1 indicates a video existence action i.

The simultaneous action relation modeling module is a transducer decoder, and consists of a self-attention network and a forward propagation network; wherein,

the simultaneous action relation modeling module obtains a plurality of second feature graphs with the simultaneous action relation by executing the following steps:

passing the size of the first feature map of the obtained continuous multiframe through a formulaPerforming space position adjustment; wherein f _reshape (. Cndot.) represents a resizing function that accepts two inputs, a first feature map that requires resizing and the resized dimensions;

after the size of the first feature map of each frame is subjected to space position adjustment, capturing a space relation of simultaneous actions by utilizing a self-attention network in a transducer encoder, and further mapping the dimension back to an input size by utilizing a forward propagation network in the transducer encoder so as to obtain a plurality of second feature maps with built action simultaneous relations;

by X _simu ＝f _reshape (f _simu (X _fsim ) T×d×h×w), readjusting the sizes of the plurality of second feature maps after the simultaneous relationship of the constructed actions, and outputting the readjusted sizes; wherein f _fsim (. Cndot.) represents a standard transducer encoder.

The sequence action relation modeling module utilizes a causal self-attention mechanism to construct an action sequence relation; wherein,

the sequence action relation modeling module predicts a final action result by executing the following steps:

by the formulaAdding position embedding vectors into the plurality of second feature images with the motion simultaneous relations after the size adjustment so as to keep the position information of each frame; wherein p represents a learnable position embedding vector;

by the formulaPerforming size adjustment on a plurality of second feature graphs added with the position embedding vectors in the time dimension; wherein (1)>Representing the sequence relation of two-frame actions in the capturing group by performing two-by-two grouping on the time dimension;

construction of mask matrixAnd pass through the formulaRealizing a causal self-attention mechanism, so that each group of information can only flow from the second characteristic diagram of the previous frame to the second characteristic diagram of the next frame to obtain a plurality of groups of second characteristic diagrams; wherein f _Q (·)，f _K (·)，f _V (-) represents three linear mapping functions; d, d _k Represents a scaling factor, f _softmax (. Cndot.) represents a softmax function;

by the formulaAggregating each group of second feature graphs, and obtaining a plurality of third feature graphs with sequence action relation by utilizing causal self-attention and feature aggregation until the time dimension is reduced to 1>Wherein α represents a superparameter,/->Representing a second feature map after aggregation;

feature fusion is carried out on a plurality of second feature graphs with action simultaneous relations and a plurality of third feature graphs with sequence action relations by utilizing residual errors, and a formula X is utilized _seq ＝f _reshape (X′ _seq +f _mean (X _fseq ) D×h×w), and performing size adjustment; wherein f _mean (. Cndot.) represents the mean function;

by the formulaClassifying the fused feature images, and predicting a final action result; wherein f _fc (. Cndot.) represents a fully attached layer, f _gap (-) represents global pooling; />Representing the confidence of the prediction.

Wherein the prediction confidenceThe loss function of (2) is:

wherein σ (·) represents the sigmoid function.

The embodiment of the invention has the following beneficial effects:

the invention not only provides a simultaneous action relation modeling module to effectively construct the action simultaneous relation, but also provides a sequential action relation modeling module to improve the defect that the original self-attention cannot capture the sequence information, constraint information can only be transmitted from front to back, and the action sequence relation is effectively constructed, so that the problem that the time sequence relation is ignored by only considering the action co-occurrence relation in the prior art can be solved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that it is within the scope of the invention to one skilled in the art to obtain other drawings from these drawings without inventive faculty.

FIG. 1 is a flowchart of a multi-tag action recognition method based on simultaneous and sequential action relationship modeling provided by an embodiment of the present invention;

FIG. 2 is a logic diagram of a simultaneous action relation modeling module in a multi-label action recognition method based on simultaneous and sequential action relation modeling according to an embodiment of the present invention;

FIG. 3 is a visual result diagram of capturing a simultaneous relationship using the action simultaneous relationship modeling module of FIG. 2;

FIG. 4 is a logic diagram of a sequential action relation modeling module in a multi-label action recognition method based on simultaneous and sequential action relation modeling according to an embodiment of the present invention;

FIG. 5 is a graph of a visual result of capturing a sequence relationship using the sequence action relationship modeling module of FIG. 4;

fig. 6 is an overall logic diagram of a multi-tag action recognition method based on simultaneous and sequential action relationship modeling according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent.

As shown in fig. 1, in an embodiment of the present invention, a multi-tag action recognition method based on simultaneous and sequential action relationship modeling is provided, where the method includes the following steps:

step S1, acquiring a video to be identified, and carrying out feature extraction based on a preset feature extraction network to obtain a first feature map of a continuous multi-frame;

firstly, acquiring a video to be identified.

Secondly, carrying out feature extraction on the video to be identified by a predefined feature extraction network to obtain a first feature map of a continuous multi-frame; wherein the feature extraction network is an X3D series feature extractor, and the first feature image obtained by outputting the feature extraction network is represented by the formulaTo represent; wherein f _backbone (. Cndot.) represents a feature extractor; x represents a first feature map of each frame; H. w and D respectively represent the height, width and channel number of the first feature map of each frame; v= { V ₁ ,v ₂ ,...,v _T -representing a video to be identified; t represents the total number of frames of the video; y= { y ₁ ,…,y _C -representing action category labels of the video to be identified, C being the number of actions present in the dataset, y _i =1 indicates a video existence action i.

S2, in a preset simultaneous action relation modeling module, inputting the first feature graphs of the obtained continuous multi-frame, modeling the relation of simultaneous actions based on the spatial information of each first feature graph, and outputting a plurality of second feature graphs with the relation of simultaneous actions;

firstly, predefining a simultaneous action relation modeling module; the simultaneous action relationship modeling module is a transducer decoder, which is composed of a self-attention network and a forward propagation network, as shown in fig. 2.

Secondly, leading the first feature images of the continuous multi-frame into a simultaneous action relation modeling module, modeling the relation of simultaneous actions based on the space information of each first feature image, and outputting a plurality of second feature images with the relation of simultaneous actions, wherein the second feature images are specifically as follows:

by X _simu ＝f _reshape (f _simu (X _fsim ) T×d×h×w), and readjusting the sizes of the plurality of second feature maps obtained by the simultaneous relationship of the constructed actions, and outputting the readjusted sizes; wherein f _fsim (. Cndot.) represents a standard transducer encoder.

In one example, as shown in FIG. 3, a graph of simultaneous relationship visualization results is captured. In fig. 3, it can be seen that the contemporaneous action that was not otherwise activated is activated after the contemporaneous relationship is captured.

And S3, inputting the obtained second feature graphs with the action simultaneous relationship into a preset sequence action relationship modeling module, and classifying and predicting a final action result based on a feature fusion mode of the input second feature graphs with the action simultaneous relationship and the third feature graphs with the sequence action relationship obtained in the middle process.

The specific process is that, first, a predefined sequence action relation modeling module is used for constructing action sequence relations by using a causal self-attention mechanism, and capturing the sequence relations of actions at a time sequence level, as shown in fig. 4.

Secondly, a plurality of second feature graphs with action simultaneous relations are imported into a sequence action relation modeling module to conduct feature extraction, fusion and classification, and multi-label action types are predicted, wherein the method specifically comprises the following steps:

It should be noted that the prediction confidenceThe loss function of (2) is:

wherein σ (·) represents the sigmoid function.

In one example, as shown in FIG. 5, a result graph is visualized for the captured sequence relationships. In fig. 5, it can be seen that after capturing the sequence relationships, the actions that were not otherwise activated occur in tandem are activated.

At this time, an overall logic diagram of a multi-tag action recognition method based on simultaneous and sequential action relationship modeling in the embodiment of the present invention is shown in fig. 6.

The embodiment of the invention has the following beneficial effects:

Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc.

The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.

Claims

1. A multi-tag action recognition method based on simultaneous and sequential action relationship modeling, the method comprising the steps of:

2. The multi-label motion recognition method based on simultaneous and sequential motion relationship modeling of claim 1, wherein the feature extraction network is an X3D series of feature extractors that output a resulting first feature map through a formulaTo represent; wherein f _backbone (. Cndot.) represents a feature extractor; x represents a first feature map of each frame; H. w and D respectively represent the height, width and channel number of the first feature map of each frame; v= { V ₁ ,v ₂ ,...,v _T -representing the video to be identified; t represents the total number of frames of the video; y= { y ₁ ,…,y _C -representing action category labels of the video to be identified, C being the number of actions present in the dataset, y _i =1 indicates a video existence action i.

3. The multi-tag action recognition method based on simultaneous and sequential action relationship modeling of claim 2, wherein said simultaneous action relationship modeling module is a transducer decoder, which is comprised of a self-attention network and a forward propagation network; wherein,

4. The multi-tag action recognition method based on simultaneous and sequential action relationship modeling of claim 3, wherein said sequential action relationship modeling module utilizes a causal self-attention mechanism to construct an action sequence relationship; wherein,

construction of mask matrixAnd pass through the formulaRealizing a causal self-attention mechanism, so that each group of information can only flow from the second characteristic diagram of the previous frame to the second characteristic diagram of the next frame to obtain a plurality of groups of second characteristic diagrams; wherein f _Q (·)，f _K (·)，f _V (-) represents three linear mapping functions; d, d _k Represents a scaling factor, f _spftmax (. Cndot.) represents a softmax function;

5. The multi-tag action recognition method based on simultaneous and sequential action relationship modeling of claim 4, wherein the predictive confidence levelThe loss function of (2) is:

wherein σ (·) represents the sigmoid function.