CN117409479A - Multi-label action recognition method based on simultaneous and sequential action relation modeling - Google Patents

Multi-label action recognition method based on simultaneous and sequential action relation modeling Download PDF

Info

Publication number
CN117409479A
CN117409479A CN202311407105.4A CN202311407105A CN117409479A CN 117409479 A CN117409479 A CN 117409479A CN 202311407105 A CN202311407105 A CN 202311407105A CN 117409479 A CN117409479 A CN 117409479A
Authority
CN
China
Prior art keywords
action
feature
simultaneous
relation
graphs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311407105.4A
Other languages
Chinese (zh)
Inventor
陈钊民
张笑钦
夏超群
葛一粟
章国道
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou University
Original Assignee
Wenzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou University filed Critical Wenzhou University
Priority to CN202311407105.4A priority Critical patent/CN117409479A/en
Publication of CN117409479A publication Critical patent/CN117409479A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-label action recognition method based on simultaneous and sequential action relation modeling, which comprises the steps of obtaining a video to be recognized, and carrying out feature extraction based on a feature extraction network to obtain a first feature map of a continuous multi-frame; in the simultaneous action relation modeling module, first feature graphs are input, the relation of simultaneous actions is modeled based on the space information of each first feature graph, and a plurality of second feature graphs with the relation of simultaneous actions are output; and inputting a plurality of second feature graphs with the simultaneous relation of actions in a sequence action relation modeling module, and classifying and predicting a final action result based on a mode of feature fusion between the plurality of second feature graphs and a plurality of third feature graphs with the sequence action relation obtained in an intermediate process. By implementing the method and the device, the problem that the time sequence relation is ignored only by considering the action co-occurrence relation in the prior art can be solved.

Description

Multi-label action recognition method based on simultaneous and sequential action relation modeling
Technical Field
The invention relates to the technical field of image processing, in particular to a multi-label action recognition method based on simultaneous and sequential action relation modeling.
Background
With the rapid development of computer vision, video motion recognition based on a deep convolutional neural network is increasingly focused by researchers, and motion recognition algorithms are gradually applied to the fields of video monitoring and the like. Although the accuracy of single motion recognition has gained a great breakthrough at present, in real-world scenes, the captured video often contains multiple motions.
In recent years, researchers have gradually shifted the center of gravity to multi-labeled action videos with the goal of identifying all actions in the video. However, since different actions in the video may occur on different objects and the time at which the different actions occur varies, this presents a significant challenge for multi-tag action recognition.
Currently, the research direction of the mainstream multi-tag action recognition task is to split the multi-tag into a plurality of single-tag tasks to perform classification recognition. Although these methods can also bring about performance improvement for multi-tag action recognition to some extent due to the performance improvement of the feature extractor, the relation between actions is not considered, resulting in performance limitation. To solve this problem, researchers have focused on establishing relationships between actions to reduce search space and thereby improve recognition performance, and a main method is to use a multi-tag image recognition method to establish a graph structure using co-occurrence of actions and then capture relationships of actions using the graph structure. Such methods, while relatively intuitive, have timing relationships for actions, and the timing relationships cannot be arbitrarily disturbed, so that accurate action relationships cannot be constructed.
Therefore, a new multi-tag action recognition method is needed to solve the problem that only the action co-occurrence relationship is considered and the time sequence relationship is ignored in the prior art.
Disclosure of Invention
The technical problem to be solved by the embodiment of the invention is to provide a multi-label action recognition method based on simultaneous and sequential action relation modeling, which can solve the problem that the time sequence relation is ignored only by considering the action co-occurrence relation in the prior art.
In order to solve the technical problems, the embodiment of the invention provides a multi-label action recognition method based on simultaneous and sequential action relation modeling, which comprises the following steps:
acquiring a video to be identified, and carrying out feature extraction based on a preset feature extraction network to obtain a first feature map of a continuous multi-frame;
in a preset simultaneous action relation modeling module, inputting the first feature graphs of the obtained continuous multi-frame, modeling the relation of simultaneous actions based on the spatial information of each first feature graph, and outputting a plurality of second feature graphs with the relation of simultaneous actions;
and in a preset sequence action relation modeling module, inputting a plurality of second feature graphs with action simultaneous relations, and classifying and predicting a final action result based on a mode of feature fusion of the plurality of second feature graphs with the action simultaneous relations and a plurality of third feature graphs with the sequence action relations obtained in an intermediate process.
Wherein the feature extraction network is an X3D series feature extractorThe first characteristic diagram obtained by the output of the method passes through the formulaTo represent; wherein,
f backbone (. Cndot.) represents a feature extractor; x represents a first feature map of each frame; H. w and D respectively represent the height, width and channel number of the first feature map of each frame; v= { V 1 ,v 2 ,...,v T -representing the video to be identified; t represents the total number of frames of the video; y= { y 1 ,…,y C -representing action category labels of the video to be identified, C being the number of actions present in the dataset, y i =1 indicates a video existence action i.
The simultaneous action relation modeling module is a transducer decoder, and consists of a self-attention network and a forward propagation network; wherein,
the simultaneous action relation modeling module obtains a plurality of second feature graphs with the simultaneous action relation by executing the following steps:
passing the size of the first feature map of the obtained continuous multiframe through a formulaPerforming space position adjustment; wherein f reshape (. Cndot.) represents a resizing function that accepts two inputs, a first feature map that requires resizing and the resized dimensions;
after the size of the first feature map of each frame is subjected to space position adjustment, capturing a space relation of simultaneous actions by utilizing a self-attention network in a transducer encoder, and further mapping the dimension back to an input size by utilizing a forward propagation network in the transducer encoder so as to obtain a plurality of second feature maps with built action simultaneous relations;
by X simu =f reshape (f simu (X fsim ) T×d×h×w), readjusting the sizes of the plurality of second feature maps after the simultaneous relationship of the constructed actions, and outputting the readjusted sizes; wherein f fsim (. Cndot.) represents a standard transducer encoder.
The sequence action relation modeling module utilizes a causal self-attention mechanism to construct an action sequence relation; wherein,
the sequence action relation modeling module predicts a final action result by executing the following steps:
by the formulaAdding position embedding vectors into the plurality of second feature images with the motion simultaneous relations after the size adjustment so as to keep the position information of each frame; wherein p represents a learnable position embedding vector;
by the formulaPerforming size adjustment on a plurality of second feature graphs added with the position embedding vectors in the time dimension; wherein (1)>Representing the sequence relation of two-frame actions in the capturing group by performing two-by-two grouping on the time dimension;
construction of mask matrixAnd pass through the formulaRealizing a causal self-attention mechanism, so that each group of information can only flow from the second characteristic diagram of the previous frame to the second characteristic diagram of the next frame to obtain a plurality of groups of second characteristic diagrams; wherein f Q (·),f K (·),f V (-) represents three linear mapping functions; d, d k Represents a scaling factor, f softmax (. Cndot.) represents a softmax function;
by the formulaAggregating each group of second feature graphs, and obtaining a plurality of third feature graphs with sequence action relation by utilizing causal self-attention and feature aggregation until the time dimension is reduced to 1>Wherein α represents a superparameter,/->Representing a second feature map after aggregation;
feature fusion is carried out on a plurality of second feature graphs with action simultaneous relations and a plurality of third feature graphs with sequence action relations by utilizing residual errors, and a formula X is utilized seq =f reshape (X′ seq +f mean (X fseq ) D×h×w), and performing size adjustment; wherein f mean (. Cndot.) represents the mean function;
by the formulaClassifying the fused feature images, and predicting a final action result; wherein f fc (. Cndot.) represents a fully attached layer, f gap (-) represents global pooling; />Representing the confidence of the prediction.
Wherein the prediction confidenceThe loss function of (2) is:
wherein σ (·) represents the sigmoid function.
The embodiment of the invention has the following beneficial effects:
the invention not only provides a simultaneous action relation modeling module to effectively construct the action simultaneous relation, but also provides a sequential action relation modeling module to improve the defect that the original self-attention cannot capture the sequence information, constraint information can only be transmitted from front to back, and the action sequence relation is effectively constructed, so that the problem that the time sequence relation is ignored by only considering the action co-occurrence relation in the prior art can be solved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that it is within the scope of the invention to one skilled in the art to obtain other drawings from these drawings without inventive faculty.
FIG. 1 is a flowchart of a multi-tag action recognition method based on simultaneous and sequential action relationship modeling provided by an embodiment of the present invention;
FIG. 2 is a logic diagram of a simultaneous action relation modeling module in a multi-label action recognition method based on simultaneous and sequential action relation modeling according to an embodiment of the present invention;
FIG. 3 is a visual result diagram of capturing a simultaneous relationship using the action simultaneous relationship modeling module of FIG. 2;
FIG. 4 is a logic diagram of a sequential action relation modeling module in a multi-label action recognition method based on simultaneous and sequential action relation modeling according to an embodiment of the present invention;
FIG. 5 is a graph of a visual result of capturing a sequence relationship using the sequence action relationship modeling module of FIG. 4;
fig. 6 is an overall logic diagram of a multi-tag action recognition method based on simultaneous and sequential action relationship modeling according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent.
As shown in fig. 1, in an embodiment of the present invention, a multi-tag action recognition method based on simultaneous and sequential action relationship modeling is provided, where the method includes the following steps:
step S1, acquiring a video to be identified, and carrying out feature extraction based on a preset feature extraction network to obtain a first feature map of a continuous multi-frame;
firstly, acquiring a video to be identified.
Secondly, carrying out feature extraction on the video to be identified by a predefined feature extraction network to obtain a first feature map of a continuous multi-frame; wherein the feature extraction network is an X3D series feature extractor, and the first feature image obtained by outputting the feature extraction network is represented by the formulaTo represent; wherein f backbone (. Cndot.) represents a feature extractor; x represents a first feature map of each frame; H. w and D respectively represent the height, width and channel number of the first feature map of each frame; v= { V 1 ,v 2 ,...,v T -representing a video to be identified; t represents the total number of frames of the video; y= { y 1 ,…,y C -representing action category labels of the video to be identified, C being the number of actions present in the dataset, y i =1 indicates a video existence action i.
S2, in a preset simultaneous action relation modeling module, inputting the first feature graphs of the obtained continuous multi-frame, modeling the relation of simultaneous actions based on the spatial information of each first feature graph, and outputting a plurality of second feature graphs with the relation of simultaneous actions;
firstly, predefining a simultaneous action relation modeling module; the simultaneous action relationship modeling module is a transducer decoder, which is composed of a self-attention network and a forward propagation network, as shown in fig. 2.
Secondly, leading the first feature images of the continuous multi-frame into a simultaneous action relation modeling module, modeling the relation of simultaneous actions based on the space information of each first feature image, and outputting a plurality of second feature images with the relation of simultaneous actions, wherein the second feature images are specifically as follows:
passing the size of the first feature map of the obtained continuous multiframe through a formulaPerforming space position adjustment; wherein f reshape (. Cndot.) represents a resizing function that accepts two inputs, a first feature map that requires resizing and the resized dimensions;
after the size of the first feature map of each frame is subjected to space position adjustment, capturing a space relation of simultaneous actions by utilizing a self-attention network in a transducer encoder, and further mapping the dimension back to an input size by utilizing a forward propagation network in the transducer encoder so as to obtain a plurality of second feature maps with built action simultaneous relations;
by X simu =f reshape (f simu (X fsim ) T×d×h×w), and readjusting the sizes of the plurality of second feature maps obtained by the simultaneous relationship of the constructed actions, and outputting the readjusted sizes; wherein f fsim (. Cndot.) represents a standard transducer encoder.
In one example, as shown in FIG. 3, a graph of simultaneous relationship visualization results is captured. In fig. 3, it can be seen that the contemporaneous action that was not otherwise activated is activated after the contemporaneous relationship is captured.
And S3, inputting the obtained second feature graphs with the action simultaneous relationship into a preset sequence action relationship modeling module, and classifying and predicting a final action result based on a feature fusion mode of the input second feature graphs with the action simultaneous relationship and the third feature graphs with the sequence action relationship obtained in the middle process.
The specific process is that, first, a predefined sequence action relation modeling module is used for constructing action sequence relations by using a causal self-attention mechanism, and capturing the sequence relations of actions at a time sequence level, as shown in fig. 4.
Secondly, a plurality of second feature graphs with action simultaneous relations are imported into a sequence action relation modeling module to conduct feature extraction, fusion and classification, and multi-label action types are predicted, wherein the method specifically comprises the following steps:
by the formulaAdding position embedding vectors into the plurality of second feature images with the motion simultaneous relations after the size adjustment so as to keep the position information of each frame; wherein p represents a learnable position embedding vector;
by the formulaPerforming size adjustment on a plurality of second feature graphs added with the position embedding vectors in the time dimension; wherein (1)>Representing the sequence relation of two-frame actions in the capturing group by performing two-by-two grouping on the time dimension;
construction of mask matrixAnd pass through the formulaRealizing a causal self-attention mechanism, so that each group of information can only flow from the second characteristic diagram of the previous frame to the second characteristic diagram of the next frame to obtain a plurality of groups of second characteristic diagrams; wherein f Q (·),f K (·),f V (-) represents three linear mapping functions; d, d k Represents a scaling factor, f softmax (. Cndot.) represents a softmax function;
by the formulaAggregating each group of second feature graphs, and obtaining a plurality of third feature graphs with sequence action relation by utilizing causal self-attention and feature aggregation until the time dimension is reduced to 1>Wherein α represents a superparameter,/->Representing a second feature map after aggregation;
feature fusion is carried out on a plurality of second feature graphs with action simultaneous relations and a plurality of third feature graphs with sequence action relations by utilizing residual errors, and a formula X is utilized seq =f reshape (X′ seq +f mean (X fseq ) D×h×w), and performing size adjustment; wherein f mean (. Cndot.) represents the mean function;
by the formulaClassifying the fused feature images, and predicting a final action result; wherein f fc (. Cndot.) represents a fully attached layer, f gap (-) represents global pooling; />Representing the confidence of the prediction.
It should be noted that the prediction confidenceThe loss function of (2) is:
wherein σ (·) represents the sigmoid function.
In one example, as shown in FIG. 5, a result graph is visualized for the captured sequence relationships. In fig. 5, it can be seen that after capturing the sequence relationships, the actions that were not otherwise activated occur in tandem are activated.
At this time, an overall logic diagram of a multi-tag action recognition method based on simultaneous and sequential action relationship modeling in the embodiment of the present invention is shown in fig. 6.
The embodiment of the invention has the following beneficial effects:
the invention not only provides a simultaneous action relation modeling module to effectively construct the action simultaneous relation, but also provides a sequential action relation modeling module to improve the defect that the original self-attention cannot capture the sequence information, constraint information can only be transmitted from front to back, and the action sequence relation is effectively constructed, so that the problem that the time sequence relation is ignored by only considering the action co-occurrence relation in the prior art can be solved.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc.
The foregoing disclosure is illustrative of the present invention and is not to be construed as limiting the scope of the invention, which is defined by the appended claims.

Claims (5)

1. A multi-tag action recognition method based on simultaneous and sequential action relationship modeling, the method comprising the steps of:
acquiring a video to be identified, and carrying out feature extraction based on a preset feature extraction network to obtain a first feature map of a continuous multi-frame;
in a preset simultaneous action relation modeling module, inputting the first feature graphs of the obtained continuous multi-frame, modeling the relation of simultaneous actions based on the spatial information of each first feature graph, and outputting a plurality of second feature graphs with the relation of simultaneous actions;
and in a preset sequence action relation modeling module, inputting a plurality of second feature graphs with action simultaneous relations, and classifying and predicting a final action result based on a mode of feature fusion of the plurality of second feature graphs with the action simultaneous relations and a plurality of third feature graphs with the sequence action relations obtained in an intermediate process.
2. The multi-label motion recognition method based on simultaneous and sequential motion relationship modeling of claim 1, wherein the feature extraction network is an X3D series of feature extractors that output a resulting first feature map through a formulaTo represent; wherein f backbone (. Cndot.) represents a feature extractor; x represents a first feature map of each frame; H. w and D respectively represent the height, width and channel number of the first feature map of each frame; v= { V 1 ,v 2 ,...,v T -representing the video to be identified; t represents the total number of frames of the video; y= { y 1 ,…,y C -representing action category labels of the video to be identified, C being the number of actions present in the dataset, y i =1 indicates a video existence action i.
3. The multi-tag action recognition method based on simultaneous and sequential action relationship modeling of claim 2, wherein said simultaneous action relationship modeling module is a transducer decoder, which is comprised of a self-attention network and a forward propagation network; wherein,
the simultaneous action relation modeling module obtains a plurality of second feature graphs with the simultaneous action relation by executing the following steps:
passing the size of the first feature map of the obtained continuous multiframe through a formulaPerforming space position adjustment; wherein f reshape (. Cndot.) represents a resizing function that accepts two inputs, a first feature map that requires resizing and the resized dimensions;
after the size of the first feature map of each frame is subjected to space position adjustment, capturing a space relation of simultaneous actions by utilizing a self-attention network in a transducer encoder, and further mapping the dimension back to an input size by utilizing a forward propagation network in the transducer encoder so as to obtain a plurality of second feature maps with built action simultaneous relations;
by X simu =f reshape (f simu (X fsim ) T×d×h×w), readjusting the sizes of the plurality of second feature maps after the simultaneous relationship of the constructed actions, and outputting the readjusted sizes; wherein f fsim (. Cndot.) represents a standard transducer encoder.
4. The multi-tag action recognition method based on simultaneous and sequential action relationship modeling of claim 3, wherein said sequential action relationship modeling module utilizes a causal self-attention mechanism to construct an action sequence relationship; wherein,
the sequence action relation modeling module predicts a final action result by executing the following steps:
by the formulaAdding position embedding vectors into the plurality of second feature images with the motion simultaneous relations after the size adjustment so as to keep the position information of each frame; wherein p represents a learnable position embedding vector;
by the formulaPerforming size adjustment on a plurality of second feature graphs added with the position embedding vectors in the time dimension; wherein (1)>Representing the sequence relation of two-frame actions in the capturing group by performing two-by-two grouping on the time dimension;
construction of mask matrixAnd pass through the formulaRealizing a causal self-attention mechanism, so that each group of information can only flow from the second characteristic diagram of the previous frame to the second characteristic diagram of the next frame to obtain a plurality of groups of second characteristic diagrams; wherein f Q (·),f K (·),f V (-) represents three linear mapping functions; d, d k Represents a scaling factor, f spftmax (. Cndot.) represents a softmax function;
by the formulaAggregating each group of second feature graphs, and obtaining a plurality of third feature graphs with sequence action relation by utilizing causal self-attention and feature aggregation until the time dimension is reduced to 1>Wherein α represents a superparameter,/->Representing a second feature map after aggregation;
feature fusion is carried out on a plurality of second feature graphs with action simultaneous relations and a plurality of third feature graphs with sequence action relations by utilizing residual errors, and a formula X is utilized seq =f reshape (X′ seq +f mean (X fseq ) D×h×w), and performing size adjustment; wherein f mean (. Cndot.) represents the mean function;
by the formulaClassifying the fused feature images, and predicting a final action result; wherein f fc (. Cndot.) represents a fully attached layer, f gap (-) represents global pooling; />Representing the confidence of the prediction.
5. The multi-tag action recognition method based on simultaneous and sequential action relationship modeling of claim 4, wherein the predictive confidence levelThe loss function of (2) is:
wherein σ (·) represents the sigmoid function.
CN202311407105.4A 2023-10-26 2023-10-26 Multi-label action recognition method based on simultaneous and sequential action relation modeling Pending CN117409479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311407105.4A CN117409479A (en) 2023-10-26 2023-10-26 Multi-label action recognition method based on simultaneous and sequential action relation modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311407105.4A CN117409479A (en) 2023-10-26 2023-10-26 Multi-label action recognition method based on simultaneous and sequential action relation modeling

Publications (1)

Publication Number Publication Date
CN117409479A true CN117409479A (en) 2024-01-16

Family

ID=89492222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311407105.4A Pending CN117409479A (en) 2023-10-26 2023-10-26 Multi-label action recognition method based on simultaneous and sequential action relation modeling

Country Status (1)

Country Link
CN (1) CN117409479A (en)

Similar Documents

Publication Publication Date Title
CN111210443B (en) Deformable convolution mixing task cascading semantic segmentation method based on embedding balance
CN111639692B (en) Shadow detection method based on attention mechanism
CN110175580B (en) Video behavior identification method based on time sequence causal convolutional network
Zhang et al. Self-supervised visual representation learning from hierarchical grouping
CN111968150B (en) Weak surveillance video target segmentation method based on full convolution neural network
CN110569814B (en) Video category identification method, device, computer equipment and computer storage medium
CN110781850A (en) Semantic segmentation system and method for road recognition, and computer storage medium
Li et al. Joint semantic-instance segmentation method for intelligent transportation system
CN110705412A (en) Video target detection method based on motion history image
CN111563507A (en) Indoor scene semantic segmentation method based on convolutional neural network
CN111768415A (en) Image instance segmentation method without quantization pooling
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN113628244A (en) Target tracking method, system, terminal and medium based on label-free video training
CN114494981A (en) Action video classification method and system based on multi-level motion modeling
WO2023036157A1 (en) Self-supervised spatiotemporal representation learning by exploring video continuity
CN113850135A (en) Dynamic gesture recognition method and system based on time shift frame
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN115482387A (en) Weak supervision image semantic segmentation method and system based on multi-scale class prototype
CN115908793A (en) Coding and decoding structure semantic segmentation model based on position attention mechanism
CN110942463B (en) Video target segmentation method based on generation countermeasure network
CN112419352B (en) Small sample semantic segmentation method based on contour
Li A deep learning-based text detection and recognition approach for natural scenes
US20240062347A1 (en) Multi-scale fusion defogging method based on stacked hourglass network
CN117315752A (en) Training method, device, equipment and medium for face emotion recognition network model
CN112016434A (en) Lens motion identification method based on attention mechanism 3D residual error network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination