CN114333057A - Combined action recognition method and system based on multi-level feature interactive fusion - Google Patents

Combined action recognition method and system based on multi-level feature interactive fusion Download PDF

Info

Publication number
CN114333057A
CN114333057A CN202111639981.0A CN202111639981A CN114333057A CN 114333057 A CN114333057 A CN 114333057A CN 202111639981 A CN202111639981 A CN 202111639981A CN 114333057 A CN114333057 A CN 114333057A
Authority
CN
China
Prior art keywords
semantic
feature
features
instance
interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111639981.0A
Other languages
Chinese (zh)
Inventor
舒祥波
崔亚飞
唐金辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology
Original Assignee
Nanjing University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology filed Critical Nanjing University of Science and Technology
Priority to CN202111639981.0A priority Critical patent/CN114333057A/en
Publication of CN114333057A publication Critical patent/CN114333057A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a combined action recognition method and a system based on multi-level feature interactive fusion, wherein the method comprises the following steps: position-to-appearance feature extraction, which is to extract a combined feature centered on an example from low-level appearance information by using position information; semantic feature interaction, further exploring semantic interaction between the joint features and the instance identities; predicting semantic to position, and mapping semantic features back to a low-dimensional position space to realize example position prediction; and (4) predicting action types, and aggregating the characteristics taking the example as the center to realize action recognition. The method can effectively fuse the characteristics of different sources and improve the accuracy of combined action identification.

Description

Combined action recognition method and system based on multi-level feature interactive fusion
Technical Field
The invention relates to the technical field of action recognition in the field of computer vision, in particular to a combined action recognition method and a system based on multi-level feature interactive fusion.
Background
Human motion recognition aims at understanding human motion from a given video sequence. In recent years, many deep learning methods are used in this field, and people tend to use a strong backbone network to extract video features for recognition. Due to the complexity of the motion and the finite nature of the sampling frame, the features extracted by such methods often have an apparent generalized bias, i.e., associating the motion with a particular object or scene. This method is effective for recognizing simple actions depending on scene information, but is insufficient for motion recognition having a complicated time structure.
When recognizing motion, humans not only pay attention to scene information, but also focus on the temporal variation relationship of distances between objects in a scene. This ability allows a human to easily recognize an unseen combination of motion and objects. This idea of combined action recognition is introduced into the model design. Obviously, such combined reasoning requires not only picture features that can represent a scene, but also information from other inputs such as object position. Since features of different levels have great differences in modality and dimension, how to effectively fuse multi-level features is a main problem of combined action recognition.
Disclosure of Invention
The invention aims to provide a combined action recognition method and a combined action recognition system based on multi-level feature interactive fusion, which can fuse information from different sources in an interactive mode and improve the accuracy of combined action recognition.
The technical solution for realizing the purpose of the invention is as follows: a combined action recognition method based on multi-level feature interactive fusion comprises the following steps:
step 1: position-to-appearance feature extraction, which is to extract a combined feature centered on an example from low-level appearance information by using position information;
step 2: semantic feature interaction, further exploring semantic interaction between the joint features and the instance identities;
and step 3: predicting semantic to position, and mapping semantic features back to a low-dimensional position space to realize example position prediction;
and 4, step 4: and (4) predicting action types, and aggregating the characteristics taking the example as the center to realize action recognition.
A combined action recognition system based on multi-level feature interactive fusion comprises a feature extraction module, a semantic feature interaction module, a semantic-to-position prediction module and an action category prediction module, wherein:
the feature extraction module is used for extracting position-to-appearance features, and extracting combined features taking an example as a center from low-level appearance information based on example position information;
the semantic feature interaction module is used for performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;
the semantic-to-position prediction module is used for predicting semantic to position, and remapping semantic features to the position space of the original dimension for example position prediction;
the action type prediction module is used for action type prediction and aggregating characteristics taking an example as a center to carry out combined action recognition.
A combined action recognition system based on multi-level feature interactive fusion comprises a feature extraction module, a semantic feature interaction module, a semantic-to-position prediction module and an action category prediction module, wherein:
the feature extraction module is used for extracting position-to-appearance features, and extracting combined features taking an example as a center from low-level appearance information based on example position information;
the semantic feature interaction module is used for performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;
the semantic-to-position prediction module is used for predicting semantic to position, and remapping semantic features to the position space of the original dimension for example position prediction;
the action type prediction module is used for action type prediction and aggregating characteristics taking an example as a center to carry out combined action recognition.
Compared with the prior art, the invention has the beneficial effects that:
(1) a simple and unified combined action recognition system is provided, in the system, different information sources (including appearance, position and semantic features) can be subjected to multi-stage interactive fusion, and fusion of multi-mode information is promoted;
(2) according to the method, through an auxiliary position prediction task, the model is forced to pay more attention to potential motion clues, and the modeling of the model on time information is promoted;
(3) the method explicitly models non-appearance characteristics, and improves the generalization capability of the model to the appearance of scenes and objects.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a system framework diagram of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings:
with reference to fig. 1 and fig. 2, a combined action recognition method based on multi-level feature interaction includes four processes of position-to-appearance feature extraction, semantic feature interaction, semantic-to-position prediction, and action category prediction.
The position-to-appearance feature extraction includes the steps of:
step 1), sampling T frames from a video sequence, and extracting a space-time appearance representation by using a backbone network I3D to obtain a dimension of
Figure BDA0003442700650000031
Characteristic diagram of (d)feaIs the number of channels;
step 2), according to the coordinate track, using RoIAlign to cut and scale the obtained feature graph to obtain the appearance feature F of each exampleapp
Step 3), changing the example coordinates and the example identities into high-dimensional vectors respectively through mapping and embedding, connecting the high-dimensional vectors, and obtaining the non-appearance characteristics F of the object through a multilayer perceptronnon_app. The example identity refers to a code C for judging the example to be a person or an object;
step 4), connecting FappAnd Fnon_appA joint representation F of the video centered on the example is obtained.
The semantic feature interaction comprises the following steps:
and 5) carrying out space information transmission among the example nodes.Given a time dimension t, information transfer is realized between every two nodes in the space through an interaction function. According to different example identities, three interaction functions are constructed, namely
Figure BDA0003442700650000032
And
Figure BDA0003442700650000033
respectively corresponding to person-to-person epsilonssHuman and thing epsilonsoSubstance and substance epsilonooThree example pairs are aggregated. The spatial interaction function is:
Figure BDA0003442700650000034
wherein the content of the first and second substances,
Figure BDA0003442700650000035
and
Figure BDA0003442700650000036
respectively representing the fundamental joint and spatial features of the ith instance at time t [, ]]It is shown that the connection operation is performed,
Figure BDA0003442700650000037
the underlying joint features, MLPs may be employed;
and 6), carrying out time information transmission between the instance nodes. Given the spatial feature G of the ith instanceiThe temporal dependencies are further captured by the RNN and the instance features are fused along the temporal dimension.
Figure BDA0003442700650000038
Where Cat (. cndot.) is used to connect all of the RNN outputs along the last dimension
Figure BDA0003442700650000039
An output; psiTIs an MLP to compileCode each concatenated instance feature; thetaTParameters representing class learning in the RNN. For each video, the spatio-temporal features of the N instances Z ═ { Z1,Z2,...,ZNIs used for the last action classification;
the semantic to position prediction mainly comprises the following steps:
step 7), predicting the future state of each instance from the observed features. Given the t observed features before the ith instance
Figure BDA0003442700650000041
The t +1 th state is predicted as:
Figure BDA0003442700650000042
here, T is usedobsThe number of frames observed by the predictor in each step, and hence
Figure BDA0003442700650000043
RNN is used to model temporal structures hidden in data, with the final output treated as the predicted state
Figure BDA0003442700650000044
Figure BDA0003442700650000045
The RNN here shares parameters with the RNN in step 6);
step 8), the prediction state at the moment of the given ith instance t +1
Figure BDA0003442700650000046
The absolute and relative positions (the difference between the centers of some instances in two consecutive frames) are estimated by two linear layers, and the formula is:
Figure BDA0003442700650000047
wherein
Figure BDA0003442700650000048
When t is<TobsTime of flight
Figure BDA0003442700650000049
And
Figure BDA00034427006500000410
real value padding is used.
The action category prediction mainly comprises the following steps:
and 9) pooling the spatio-temporal characteristics Z of the N instances obtained in the step 6) as the representation of the video level, and obtaining the class probability through softmax, wherein the class with the maximum probability is the class to which the video belongs.

Claims (8)

1. A combined action recognition method based on multi-level feature interactive fusion is characterized by comprising the following steps:
extracting position-to-appearance features, and extracting joint features taking an example as a center from low-level appearance information based on example position information;
performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;
performing semantic-to-position prediction, and remapping semantic features to a position space of the original dimension to perform example position prediction;
and performing action type prediction, and aggregating characteristics with the example as the center to perform combined action recognition.
2. The method for identifying combined actions based on multi-level feature interactive fusion as claimed in claim 1, wherein the extracting of the joint features centered on the instance from the low-level appearance information based on the instance position information specifically comprises:
sampling T frames from a video sequence, extracting space-time representation of the T frames by adopting a deep convolutional network, and obtaining a dimensionality of
Figure FDA0003442700640000011
Figure FDA0003442700640000012
H, W are respectively the width and height of the sample picture, dfeaIs the number of channels;
according to the coordinate tracks, the appearance characteristic F of each example is obtained through RoIAlign cutting and scaling on the obtained characteristic diagramapp
The example position coordinates and the example identities are respectively changed into high-dimensional vectors through mapping and embedding modes and connected, and then the non-appearance characteristics F of the object are obtained through a multilayer perceptronnon_appThe example identity refers to the code C for judging the example to be a person or an object;
connection appearance feature FappAnd non-appearance feature Fnon_appA combined characteristic F centered on the example is obtained.
3. The method of claim 2, wherein the high dimension is 512 dimensions.
4. The method for identifying a combined action based on multi-level feature interaction fusion according to claim 1, wherein the obtaining semantic interaction between the joint feature and the instance identity specifically comprises:
carrying out spatial information transfer between example nodes, giving a time dimension t, realizing information transfer between every two nodes in space through an interaction function, and constructing three interaction functions according to different example identities, wherein the three interaction functions are respectively
Figure FDA0003442700640000013
And
Figure FDA0003442700640000014
respectively corresponding to person-to-person epsilonssHuman and thing epsilonsoSubstance and substance epsilonooFor the set of three example pairs, the spatial interaction function is:
Figure FDA0003442700640000021
wherein the content of the first and second substances,
Figure FDA0003442700640000022
and
Figure FDA0003442700640000023
respectively representing the joint and spatial characteristics of the ith instance at time t [, ]]It is shown that the connection operation is performed,
Figure FDA0003442700640000024
representing a federated characteristic;
carrying out time information transfer among instance nodes, and giving the spatial characteristic G of the ith instanceiFurther capturing temporal dependencies by RNN and fusing the instance features along the temporal dimension;
Figure FDA0003442700640000025
where Cat (. cndot.) is used to connect all of the RNN outputs along the last dimension
Figure FDA0003442700640000026
An output; psiT() is an MLP that encodes each concatenated instance feature; thetaTRepresenting learnable parameters in the RNN, for each video, N instances of spatio-temporal features Z ═ { Z ═1,Z2,...,ZNIs used for the final action classification.
5. The method of claim 4, wherein the method comprises identifying the combined actions based on multi-level feature interactive fusion
Figure FDA0003442700640000027
Obtaining using MLPsAnd (6) taking.
6. The method for identifying combined actions based on multi-level feature interactive fusion as claimed in claim 1, wherein the step of remapping semantic features back to original dimensional location space for instance location prediction specifically comprises the steps of:
given the t observed features before the ith instance
Figure FDA0003442700640000028
The t +1 th state is predicted as:
Figure FDA0003442700640000029
Tobsrepresenting the number of frames observed by the predictor in each step,
Figure FDA00034427006400000210
RNN is used to model temporal structures hidden in data, with the final output treated as the predicted state
Figure FDA00034427006400000211
Given the predicted state at the instant of the ith instance t +1
Figure FDA00034427006400000212
Estimating its absolute position by two linear layers
Figure FDA00034427006400000213
And relative position
Figure FDA00034427006400000214
Comprises the following steps:
Figure FDA00034427006400000215
wherein
Figure FDA00034427006400000216
When t is<TobsTime of flight
Figure FDA00034427006400000217
And
Figure FDA00034427006400000218
real value padding is used.
7. The method for identifying combined actions based on multi-level feature interactive fusion according to claim 1, wherein the step of aggregating the feature with the instance as the center to identify the combined actions specifically comprises the following steps:
and (4) representing the video level by using the spatio-temporal characteristics Z of the N samples obtained by pooling, and obtaining the class probability through softmax, wherein the class probability with the maximum probability is the class to which the video belongs.
8. A combined action recognition system based on multi-level feature interactive fusion is characterized by comprising a feature extraction module, a semantic feature interaction module, a semantic-to-position prediction module and an action category prediction module, wherein:
the feature extraction module is used for extracting position-to-appearance features, and extracting combined features taking an example as a center from low-level appearance information based on example position information;
the semantic feature interaction module is used for performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;
the semantic-to-position prediction module is used for predicting semantic to position, and remapping semantic features to the position space of the original dimension for example position prediction;
the action type prediction module is used for action type prediction and aggregating characteristics taking an example as a center to carry out combined action recognition.
CN202111639981.0A 2021-12-29 2021-12-29 Combined action recognition method and system based on multi-level feature interactive fusion Pending CN114333057A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111639981.0A CN114333057A (en) 2021-12-29 2021-12-29 Combined action recognition method and system based on multi-level feature interactive fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111639981.0A CN114333057A (en) 2021-12-29 2021-12-29 Combined action recognition method and system based on multi-level feature interactive fusion

Publications (1)

Publication Number Publication Date
CN114333057A true CN114333057A (en) 2022-04-12

Family

ID=81017659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111639981.0A Pending CN114333057A (en) 2021-12-29 2021-12-29 Combined action recognition method and system based on multi-level feature interactive fusion

Country Status (1)

Country Link
CN (1) CN114333057A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677633A (en) * 2022-05-26 2022-06-28 之江实验室 Multi-component feature fusion-based pedestrian detection multi-target tracking system and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677633A (en) * 2022-05-26 2022-06-28 之江实验室 Multi-component feature fusion-based pedestrian detection multi-target tracking system and method

Similar Documents

Publication Publication Date Title
CN108388900B (en) Video description method based on combination of multi-feature fusion and space-time attention mechanism
CN109711463B (en) Attention-based important object detection method
KR102592551B1 (en) Object recognition processing apparatus and method for ar device
Zeng et al. A hierarchical spatio-temporal graph convolutional neural network for anomaly detection in videos
EP3779775B1 (en) Media processing method and related apparatus
CN112861575A (en) Pedestrian structuring method, device, equipment and storage medium
CN114973049B (en) Lightweight video classification method with unified convolution and self-attention
CN109766918B (en) Salient object detection method based on multilevel context information fusion
CN111523378A (en) Human behavior prediction method based on deep learning
Lu et al. Monet: Motion-based point cloud prediction network
CN112270246B (en) Video behavior recognition method and device, storage medium and electronic equipment
CN112184780A (en) Moving object instance segmentation method
CN114333057A (en) Combined action recognition method and system based on multi-level feature interactive fusion
CN115188066A (en) Moving target detection system and method based on cooperative attention and multi-scale fusion
Liu et al. Online human action recognition with spatial and temporal skeleton features using a distributed camera network
CN113627349B (en) Dynamic facial expression recognition method based on self-attention transformation network
CN115204367A (en) Mutual encoder model for classification
CN114842411A (en) Group behavior identification method based on complementary space-time information modeling
Roy et al. Learning spatial-temporal graphs for active speaker detection
CN114120076A (en) Cross-view video gait recognition method based on gait motion estimation
Debnath et al. Attentional learn-able pooling for human activity recognition
CN113780091B (en) Video emotion recognition method based on body posture change representation
Yun et al. Background memory‐assisted zero‐shot video object segmentation for unmanned aerial and ground vehicles
Hamandi Modeling and Enhancing Deep Learning Accuracy in Computer Vision Applications
Braović et al. A brief overview of methodologies and applications in visual Internet of Things

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination