CN114333057A - Combined action recognition method and system based on multi-level feature interactive fusion - Google Patents
Combined action recognition method and system based on multi-level feature interactive fusion Download PDFInfo
- Publication number
- CN114333057A CN114333057A CN202111639981.0A CN202111639981A CN114333057A CN 114333057 A CN114333057 A CN 114333057A CN 202111639981 A CN202111639981 A CN 202111639981A CN 114333057 A CN114333057 A CN 114333057A
- Authority
- CN
- China
- Prior art keywords
- semantic
- feature
- features
- instance
- interaction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses a combined action recognition method and a system based on multi-level feature interactive fusion, wherein the method comprises the following steps: position-to-appearance feature extraction, which is to extract a combined feature centered on an example from low-level appearance information by using position information; semantic feature interaction, further exploring semantic interaction between the joint features and the instance identities; predicting semantic to position, and mapping semantic features back to a low-dimensional position space to realize example position prediction; and (4) predicting action types, and aggregating the characteristics taking the example as the center to realize action recognition. The method can effectively fuse the characteristics of different sources and improve the accuracy of combined action identification.
Description
Technical Field
The invention relates to the technical field of action recognition in the field of computer vision, in particular to a combined action recognition method and a system based on multi-level feature interactive fusion.
Background
Human motion recognition aims at understanding human motion from a given video sequence. In recent years, many deep learning methods are used in this field, and people tend to use a strong backbone network to extract video features for recognition. Due to the complexity of the motion and the finite nature of the sampling frame, the features extracted by such methods often have an apparent generalized bias, i.e., associating the motion with a particular object or scene. This method is effective for recognizing simple actions depending on scene information, but is insufficient for motion recognition having a complicated time structure.
When recognizing motion, humans not only pay attention to scene information, but also focus on the temporal variation relationship of distances between objects in a scene. This ability allows a human to easily recognize an unseen combination of motion and objects. This idea of combined action recognition is introduced into the model design. Obviously, such combined reasoning requires not only picture features that can represent a scene, but also information from other inputs such as object position. Since features of different levels have great differences in modality and dimension, how to effectively fuse multi-level features is a main problem of combined action recognition.
Disclosure of Invention
The invention aims to provide a combined action recognition method and a combined action recognition system based on multi-level feature interactive fusion, which can fuse information from different sources in an interactive mode and improve the accuracy of combined action recognition.
The technical solution for realizing the purpose of the invention is as follows: a combined action recognition method based on multi-level feature interactive fusion comprises the following steps:
step 1: position-to-appearance feature extraction, which is to extract a combined feature centered on an example from low-level appearance information by using position information;
step 2: semantic feature interaction, further exploring semantic interaction between the joint features and the instance identities;
and step 3: predicting semantic to position, and mapping semantic features back to a low-dimensional position space to realize example position prediction;
and 4, step 4: and (4) predicting action types, and aggregating the characteristics taking the example as the center to realize action recognition.
A combined action recognition system based on multi-level feature interactive fusion comprises a feature extraction module, a semantic feature interaction module, a semantic-to-position prediction module and an action category prediction module, wherein:
the feature extraction module is used for extracting position-to-appearance features, and extracting combined features taking an example as a center from low-level appearance information based on example position information;
the semantic feature interaction module is used for performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;
the semantic-to-position prediction module is used for predicting semantic to position, and remapping semantic features to the position space of the original dimension for example position prediction;
the action type prediction module is used for action type prediction and aggregating characteristics taking an example as a center to carry out combined action recognition.
A combined action recognition system based on multi-level feature interactive fusion comprises a feature extraction module, a semantic feature interaction module, a semantic-to-position prediction module and an action category prediction module, wherein:
the feature extraction module is used for extracting position-to-appearance features, and extracting combined features taking an example as a center from low-level appearance information based on example position information;
the semantic feature interaction module is used for performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;
the semantic-to-position prediction module is used for predicting semantic to position, and remapping semantic features to the position space of the original dimension for example position prediction;
the action type prediction module is used for action type prediction and aggregating characteristics taking an example as a center to carry out combined action recognition.
Compared with the prior art, the invention has the beneficial effects that:
(1) a simple and unified combined action recognition system is provided, in the system, different information sources (including appearance, position and semantic features) can be subjected to multi-stage interactive fusion, and fusion of multi-mode information is promoted;
(2) according to the method, through an auxiliary position prediction task, the model is forced to pay more attention to potential motion clues, and the modeling of the model on time information is promoted;
(3) the method explicitly models non-appearance characteristics, and improves the generalization capability of the model to the appearance of scenes and objects.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a system framework diagram of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings:
with reference to fig. 1 and fig. 2, a combined action recognition method based on multi-level feature interaction includes four processes of position-to-appearance feature extraction, semantic feature interaction, semantic-to-position prediction, and action category prediction.
The position-to-appearance feature extraction includes the steps of:
step 1), sampling T frames from a video sequence, and extracting a space-time appearance representation by using a backbone network I3D to obtain a dimension ofCharacteristic diagram of (d)feaIs the number of channels;
step 2), according to the coordinate track, using RoIAlign to cut and scale the obtained feature graph to obtain the appearance feature F of each exampleapp;
Step 3), changing the example coordinates and the example identities into high-dimensional vectors respectively through mapping and embedding, connecting the high-dimensional vectors, and obtaining the non-appearance characteristics F of the object through a multilayer perceptronnon_app. The example identity refers to a code C for judging the example to be a person or an object;
step 4), connecting FappAnd Fnon_appA joint representation F of the video centered on the example is obtained.
The semantic feature interaction comprises the following steps:
and 5) carrying out space information transmission among the example nodes.Given a time dimension t, information transfer is realized between every two nodes in the space through an interaction function. According to different example identities, three interaction functions are constructed, namelyAndrespectively corresponding to person-to-person epsilonssHuman and thing epsilonsoSubstance and substance epsilonooThree example pairs are aggregated. The spatial interaction function is:
wherein the content of the first and second substances,andrespectively representing the fundamental joint and spatial features of the ith instance at time t [, ]]It is shown that the connection operation is performed,the underlying joint features, MLPs may be employed;
and 6), carrying out time information transmission between the instance nodes. Given the spatial feature G of the ith instanceiThe temporal dependencies are further captured by the RNN and the instance features are fused along the temporal dimension.
Where Cat (. cndot.) is used to connect all of the RNN outputs along the last dimensionAn output; psiTIs an MLP to compileCode each concatenated instance feature; thetaTParameters representing class learning in the RNN. For each video, the spatio-temporal features of the N instances Z ═ { Z1,Z2,...,ZNIs used for the last action classification;
the semantic to position prediction mainly comprises the following steps:
step 7), predicting the future state of each instance from the observed features. Given the t observed features before the ith instanceThe t +1 th state is predicted as:
here, T is usedobsThe number of frames observed by the predictor in each step, and henceRNN is used to model temporal structures hidden in data, with the final output treated as the predicted state The RNN here shares parameters with the RNN in step 6);
step 8), the prediction state at the moment of the given ith instance t +1The absolute and relative positions (the difference between the centers of some instances in two consecutive frames) are estimated by two linear layers, and the formula is:
The action category prediction mainly comprises the following steps:
and 9) pooling the spatio-temporal characteristics Z of the N instances obtained in the step 6) as the representation of the video level, and obtaining the class probability through softmax, wherein the class with the maximum probability is the class to which the video belongs.
Claims (8)
1. A combined action recognition method based on multi-level feature interactive fusion is characterized by comprising the following steps:
extracting position-to-appearance features, and extracting joint features taking an example as a center from low-level appearance information based on example position information;
performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;
performing semantic-to-position prediction, and remapping semantic features to a position space of the original dimension to perform example position prediction;
and performing action type prediction, and aggregating characteristics with the example as the center to perform combined action recognition.
2. The method for identifying combined actions based on multi-level feature interactive fusion as claimed in claim 1, wherein the extracting of the joint features centered on the instance from the low-level appearance information based on the instance position information specifically comprises:
sampling T frames from a video sequence, extracting space-time representation of the T frames by adopting a deep convolutional network, and obtaining a dimensionality of H, W are respectively the width and height of the sample picture, dfeaIs the number of channels;
according to the coordinate tracks, the appearance characteristic F of each example is obtained through RoIAlign cutting and scaling on the obtained characteristic diagramapp;
The example position coordinates and the example identities are respectively changed into high-dimensional vectors through mapping and embedding modes and connected, and then the non-appearance characteristics F of the object are obtained through a multilayer perceptronnon_appThe example identity refers to the code C for judging the example to be a person or an object;
connection appearance feature FappAnd non-appearance feature Fnon_appA combined characteristic F centered on the example is obtained.
3. The method of claim 2, wherein the high dimension is 512 dimensions.
4. The method for identifying a combined action based on multi-level feature interaction fusion according to claim 1, wherein the obtaining semantic interaction between the joint feature and the instance identity specifically comprises:
carrying out spatial information transfer between example nodes, giving a time dimension t, realizing information transfer between every two nodes in space through an interaction function, and constructing three interaction functions according to different example identities, wherein the three interaction functions are respectivelyAndrespectively corresponding to person-to-person epsilonssHuman and thing epsilonsoSubstance and substance epsilonooFor the set of three example pairs, the spatial interaction function is:
wherein the content of the first and second substances,andrespectively representing the joint and spatial characteristics of the ith instance at time t [, ]]It is shown that the connection operation is performed,representing a federated characteristic;
carrying out time information transfer among instance nodes, and giving the spatial characteristic G of the ith instanceiFurther capturing temporal dependencies by RNN and fusing the instance features along the temporal dimension;
where Cat (. cndot.) is used to connect all of the RNN outputs along the last dimensionAn output; psiT() is an MLP that encodes each concatenated instance feature; thetaTRepresenting learnable parameters in the RNN, for each video, N instances of spatio-temporal features Z ═ { Z ═1,Z2,...,ZNIs used for the final action classification.
6. The method for identifying combined actions based on multi-level feature interactive fusion as claimed in claim 1, wherein the step of remapping semantic features back to original dimensional location space for instance location prediction specifically comprises the steps of:
Tobsrepresenting the number of frames observed by the predictor in each step,RNN is used to model temporal structures hidden in data, with the final output treated as the predicted state
Given the predicted state at the instant of the ith instance t +1Estimating its absolute position by two linear layersAnd relative positionComprises the following steps:
7. The method for identifying combined actions based on multi-level feature interactive fusion according to claim 1, wherein the step of aggregating the feature with the instance as the center to identify the combined actions specifically comprises the following steps:
and (4) representing the video level by using the spatio-temporal characteristics Z of the N samples obtained by pooling, and obtaining the class probability through softmax, wherein the class probability with the maximum probability is the class to which the video belongs.
8. A combined action recognition system based on multi-level feature interactive fusion is characterized by comprising a feature extraction module, a semantic feature interaction module, a semantic-to-position prediction module and an action category prediction module, wherein:
the feature extraction module is used for extracting position-to-appearance features, and extracting combined features taking an example as a center from low-level appearance information based on example position information;
the semantic feature interaction module is used for performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;
the semantic-to-position prediction module is used for predicting semantic to position, and remapping semantic features to the position space of the original dimension for example position prediction;
the action type prediction module is used for action type prediction and aggregating characteristics taking an example as a center to carry out combined action recognition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111639981.0A CN114333057A (en) | 2021-12-29 | 2021-12-29 | Combined action recognition method and system based on multi-level feature interactive fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111639981.0A CN114333057A (en) | 2021-12-29 | 2021-12-29 | Combined action recognition method and system based on multi-level feature interactive fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114333057A true CN114333057A (en) | 2022-04-12 |
Family
ID=81017659
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111639981.0A Pending CN114333057A (en) | 2021-12-29 | 2021-12-29 | Combined action recognition method and system based on multi-level feature interactive fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114333057A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114677633A (en) * | 2022-05-26 | 2022-06-28 | 之江实验室 | Multi-component feature fusion-based pedestrian detection multi-target tracking system and method |
-
2021
- 2021-12-29 CN CN202111639981.0A patent/CN114333057A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114677633A (en) * | 2022-05-26 | 2022-06-28 | 之江实验室 | Multi-component feature fusion-based pedestrian detection multi-target tracking system and method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108388900B (en) | Video description method based on combination of multi-feature fusion and space-time attention mechanism | |
CN109711463B (en) | Attention-based important object detection method | |
KR102592551B1 (en) | Object recognition processing apparatus and method for ar device | |
Zeng et al. | A hierarchical spatio-temporal graph convolutional neural network for anomaly detection in videos | |
EP3779775B1 (en) | Media processing method and related apparatus | |
CN112861575A (en) | Pedestrian structuring method, device, equipment and storage medium | |
CN114973049B (en) | Lightweight video classification method with unified convolution and self-attention | |
CN109766918B (en) | Salient object detection method based on multilevel context information fusion | |
CN111523378A (en) | Human behavior prediction method based on deep learning | |
Lu et al. | Monet: Motion-based point cloud prediction network | |
CN112270246B (en) | Video behavior recognition method and device, storage medium and electronic equipment | |
CN112184780A (en) | Moving object instance segmentation method | |
CN114333057A (en) | Combined action recognition method and system based on multi-level feature interactive fusion | |
CN115188066A (en) | Moving target detection system and method based on cooperative attention and multi-scale fusion | |
Liu et al. | Online human action recognition with spatial and temporal skeleton features using a distributed camera network | |
CN113627349B (en) | Dynamic facial expression recognition method based on self-attention transformation network | |
CN115204367A (en) | Mutual encoder model for classification | |
CN114842411A (en) | Group behavior identification method based on complementary space-time information modeling | |
Roy et al. | Learning spatial-temporal graphs for active speaker detection | |
CN114120076A (en) | Cross-view video gait recognition method based on gait motion estimation | |
Debnath et al. | Attentional learn-able pooling for human activity recognition | |
CN113780091B (en) | Video emotion recognition method based on body posture change representation | |
Yun et al. | Background memory‐assisted zero‐shot video object segmentation for unmanned aerial and ground vehicles | |
Hamandi | Modeling and Enhancing Deep Learning Accuracy in Computer Vision Applications | |
Braović et al. | A brief overview of methodologies and applications in visual Internet of Things |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |