CN114333057A

CN114333057A - Combined action recognition method and system based on multi-level feature interactive fusion

Info

Publication number: CN114333057A
Application number: CN202111639981.0A
Authority: CN
Inventors: 舒祥波; 崔亚飞; 唐金辉
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-12

Abstract

The invention discloses a combined action recognition method and a system based on multi-level feature interactive fusion, wherein the method comprises the following steps: position-to-appearance feature extraction, which is to extract a combined feature centered on an example from low-level appearance information by using position information; semantic feature interaction, further exploring semantic interaction between the joint features and the instance identities; predicting semantic to position, and mapping semantic features back to a low-dimensional position space to realize example position prediction; and (4) predicting action types, and aggregating the characteristics taking the example as the center to realize action recognition. The method can effectively fuse the characteristics of different sources and improve the accuracy of combined action identification.

Description

Combined action recognition method and system based on multi-level feature interactive fusion

Technical Field

The invention relates to the technical field of action recognition in the field of computer vision, in particular to a combined action recognition method and a system based on multi-level feature interactive fusion.

Background

Human motion recognition aims at understanding human motion from a given video sequence. In recent years, many deep learning methods are used in this field, and people tend to use a strong backbone network to extract video features for recognition. Due to the complexity of the motion and the finite nature of the sampling frame, the features extracted by such methods often have an apparent generalized bias, i.e., associating the motion with a particular object or scene. This method is effective for recognizing simple actions depending on scene information, but is insufficient for motion recognition having a complicated time structure.

When recognizing motion, humans not only pay attention to scene information, but also focus on the temporal variation relationship of distances between objects in a scene. This ability allows a human to easily recognize an unseen combination of motion and objects. This idea of combined action recognition is introduced into the model design. Obviously, such combined reasoning requires not only picture features that can represent a scene, but also information from other inputs such as object position. Since features of different levels have great differences in modality and dimension, how to effectively fuse multi-level features is a main problem of combined action recognition.

Disclosure of Invention

The invention aims to provide a combined action recognition method and a combined action recognition system based on multi-level feature interactive fusion, which can fuse information from different sources in an interactive mode and improve the accuracy of combined action recognition.

The technical solution for realizing the purpose of the invention is as follows: a combined action recognition method based on multi-level feature interactive fusion comprises the following steps:

step 1: position-to-appearance feature extraction, which is to extract a combined feature centered on an example from low-level appearance information by using position information;

step 2: semantic feature interaction, further exploring semantic interaction between the joint features and the instance identities;

and step 3: predicting semantic to position, and mapping semantic features back to a low-dimensional position space to realize example position prediction;

and 4, step 4: and (4) predicting action types, and aggregating the characteristics taking the example as the center to realize action recognition.

A combined action recognition system based on multi-level feature interactive fusion comprises a feature extraction module, a semantic feature interaction module, a semantic-to-position prediction module and an action category prediction module, wherein:

the feature extraction module is used for extracting position-to-appearance features, and extracting combined features taking an example as a center from low-level appearance information based on example position information;

the semantic feature interaction module is used for performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;

the semantic-to-position prediction module is used for predicting semantic to position, and remapping semantic features to the position space of the original dimension for example position prediction;

the action type prediction module is used for action type prediction and aggregating characteristics taking an example as a center to carry out combined action recognition.

Compared with the prior art, the invention has the beneficial effects that:

(1) a simple and unified combined action recognition system is provided, in the system, different information sources (including appearance, position and semantic features) can be subjected to multi-stage interactive fusion, and fusion of multi-mode information is promoted;

(2) according to the method, through an auxiliary position prediction task, the model is forced to pay more attention to potential motion clues, and the modeling of the model on time information is promoted;

(3) the method explicitly models non-appearance characteristics, and improves the generalization capability of the model to the appearance of scenes and objects.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 is a system framework diagram of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings:

with reference to fig. 1 and fig. 2, a combined action recognition method based on multi-level feature interaction includes four processes of position-to-appearance feature extraction, semantic feature interaction, semantic-to-position prediction, and action category prediction.

The position-to-appearance feature extraction includes the steps of:

step 1), sampling T frames from a video sequence, and extracting a space-time appearance representation by using a backbone network I3D to obtain a dimension of

Characteristic diagram of (d)_feaIs the number of channels;

step 2), according to the coordinate track, using RoIAlign to cut and scale the obtained feature graph to obtain the appearance feature F of each example_app；

Step 3), changing the example coordinates and the example identities into high-dimensional vectors respectively through mapping and embedding, connecting the high-dimensional vectors, and obtaining the non-appearance characteristics F of the object through a multilayer perceptron_{non_app}. The example identity refers to a code C for judging the example to be a person or an object;

step 4), connecting F_appAnd F_{non_app}A joint representation F of the video centered on the example is obtained.

The semantic feature interaction comprises the following steps:

and 5) carrying out space information transmission among the example nodes.Given a time dimension t, information transfer is realized between every two nodes in the space through an interaction function. According to different example identities, three interaction functions are constructed, namely

And

respectively corresponding to person-to-person epsilon_ssHuman and thing epsilon_soSubstance and substance epsilon_ooThree example pairs are aggregated. The spatial interaction function is:

wherein the content of the first and second substances,

and

respectively representing the fundamental joint and spatial features of the ith instance at time t [, ]]It is shown that the connection operation is performed,

the underlying joint features, MLPs may be employed;

and 6), carrying out time information transmission between the instance nodes. Given the spatial feature G of the ith instance_iThe temporal dependencies are further captured by the RNN and the instance features are fused along the temporal dimension.

Where Cat (. cndot.) is used to connect all of the RNN outputs along the last dimension

An output; psi^TIs an MLP to compileCode each concatenated instance feature; theta^TParameters representing class learning in the RNN. For each video, the spatio-temporal features of the N instances Z ═ { Z₁,Z₂,...,Z_NIs used for the last action classification;

the semantic to position prediction mainly comprises the following steps:

step 7), predicting the future state of each instance from the observed features. Given the t observed features before the ith instance

The t +1 th state is predicted as:

here, T is used_obsThe number of frames observed by the predictor in each step, and hence

RNN is used to model temporal structures hidden in data, with the final output treated as the predicted state

The RNN here shares parameters with the RNN in step 6);

step 8), the prediction state at the moment of the given ith instance t +1

The absolute and relative positions (the difference between the centers of some instances in two consecutive frames) are estimated by two linear layers, and the formula is:

wherein

When t is<T_obsTime of flight

And

real value padding is used.

The action category prediction mainly comprises the following steps:

and 9) pooling the spatio-temporal characteristics Z of the N instances obtained in the step 6) as the representation of the video level, and obtaining the class probability through softmax, wherein the class with the maximum probability is the class to which the video belongs.

Claims

1. A combined action recognition method based on multi-level feature interactive fusion is characterized by comprising the following steps:

extracting position-to-appearance features, and extracting joint features taking an example as a center from low-level appearance information based on example position information;

performing semantic feature interaction to obtain semantic interaction between the combined features and the instance identities;

performing semantic-to-position prediction, and remapping semantic features to a position space of the original dimension to perform example position prediction;

and performing action type prediction, and aggregating characteristics with the example as the center to perform combined action recognition.

2. The method for identifying combined actions based on multi-level feature interactive fusion as claimed in claim 1, wherein the extracting of the joint features centered on the instance from the low-level appearance information based on the instance position information specifically comprises:

sampling T frames from a video sequence, extracting space-time representation of the T frames by adopting a deep convolutional network, and obtaining a dimensionality of

H, W are respectively the width and height of the sample picture, d_feaIs the number of channels;

according to the coordinate tracks, the appearance characteristic F of each example is obtained through RoIAlign cutting and scaling on the obtained characteristic diagram_app；

The example position coordinates and the example identities are respectively changed into high-dimensional vectors through mapping and embedding modes and connected, and then the non-appearance characteristics F of the object are obtained through a multilayer perceptron_{non_app}The example identity refers to the code C for judging the example to be a person or an object;

connection appearance feature F_appAnd non-appearance feature F_{non_app}A combined characteristic F centered on the example is obtained.

3. The method of claim 2, wherein the high dimension is 512 dimensions.

4. The method for identifying a combined action based on multi-level feature interaction fusion according to claim 1, wherein the obtaining semantic interaction between the joint feature and the instance identity specifically comprises:

carrying out spatial information transfer between example nodes, giving a time dimension t, realizing information transfer between every two nodes in space through an interaction function, and constructing three interaction functions according to different example identities, wherein the three interaction functions are respectively

And

respectively corresponding to person-to-person epsilon_ssHuman and thing epsilon_soSubstance and substance epsilon_ooFor the set of three example pairs, the spatial interaction function is:

wherein the content of the first and second substances,

and

respectively representing the joint and spatial characteristics of the ith instance at time t [, ]]It is shown that the connection operation is performed,

representing a federated characteristic;

carrying out time information transfer among instance nodes, and giving the spatial characteristic G of the ith instance_iFurther capturing temporal dependencies by RNN and fusing the instance features along the temporal dimension;

An output; psi^T() is an MLP that encodes each concatenated instance feature; theta^TRepresenting learnable parameters in the RNN, for each video, N instances of spatio-temporal features Z ═ { Z ═₁,Z₂,...,Z_NIs used for the final action classification.

5. The method of claim 4, wherein the method comprises identifying the combined actions based on multi-level feature interactive fusion

Obtaining using MLPsAnd (6) taking.

6. The method for identifying combined actions based on multi-level feature interactive fusion as claimed in claim 1, wherein the step of remapping semantic features back to original dimensional location space for instance location prediction specifically comprises the steps of:

given the t observed features before the ith instance

The t +1 th state is predicted as:

T_obsrepresenting the number of frames observed by the predictor in each step,

Given the predicted state at the instant of the ith instance t +1

Estimating its absolute position by two linear layers

And relative position

Comprises the following steps:

wherein

When t is<T_obsTime of flight

And

real value padding is used.

7. The method for identifying combined actions based on multi-level feature interactive fusion according to claim 1, wherein the step of aggregating the feature with the instance as the center to identify the combined actions specifically comprises the following steps:

and (4) representing the video level by using the spatio-temporal characteristics Z of the N samples obtained by pooling, and obtaining the class probability through softmax, wherein the class probability with the maximum probability is the class to which the video belongs.

8. A combined action recognition system based on multi-level feature interactive fusion is characterized by comprising a feature extraction module, a semantic feature interaction module, a semantic-to-position prediction module and an action category prediction module, wherein: