CN111241963B

CN111241963B - First person view video interactive behavior identification method based on interactive modeling

Info

Publication number: CN111241963B
Application number: CN202010009544.XA
Authority: CN
Inventors: 郑伟诗; 蔡祎俊; 李昊昕; 陈立
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2023-07-14
Anticipated expiration: 2040-01-06
Also published as: CN111241963A

Abstract

The invention discloses a first-person visual angle video interaction behavior recognition method based on interaction modeling, which is used for separating a camera wearer from an interactor, respectively learning corresponding static appearance and dynamic motion characteristics of the camera wearer and then explicitly modeling an interaction relationship between the camera wearer and the interactor. Generating a mask by using an attention model and assisting learning of the attention model by using a human body analysis model in order to separate an interactor from the background; and providing a motion module to respectively predict motion information matrixes corresponding to the camera wearer and the interactors, and assisting in learning the motion module through reconstructing the next frame. Finally, a dual long and short time memory module for interactive modeling is provided, and the interactive relation is explicitly modeled on the basis of the module. The invention can well describe and identify the interaction behavior of the first person view angle, and obtain the current better identification result on the common first person view angle interaction behavior research data set.

Description

First person view video interactive behavior identification method based on interactive modeling

Technical Field

The invention belongs to the technical field of behavior recognition, and particularly relates to a first person view video interactive behavior recognition method based on interactive modeling.

Background

At present, the main first person group behavior recognition method is divided into two types, wherein one type uses motion characteristics of manual design such as motion trail, optical flow and the like, and combines with traditional classifiers such as a support vector machine and the like; another type of feature learning is performed by deep learning, which uses a model similar to the video behavior recognition of the third person's viewing angle, and uses a convolutional neural network and a long-short-term memory model to directly learn behavior features from video frames.

The main disadvantage of the above prior art is that there is no explicit modeling of the interaction relationship between the camera wearer and the interactors. The prior art generally directly learns the overall characteristics of the interaction behavior, but the interaction behavior of the first person view is generated by the interaction between the camera wearer and the interactors, and the interaction behavior can be better described by explicitly modeling the interaction relationship, so that the prior art lacks explicit interaction modeling, and therefore cannot be well described.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art and provides a first-person visual angle video interactive behavior recognition method based on interactive modeling.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the first person view video interactive behavior identification method based on interactive modeling comprises the following steps:

s1, explicitly separating a camera wearer and an interactor, respectively learning behavior characteristics of the camera wearer and the interactor, and comprising the following steps:

s1.1, separating an interactor from the background through an attention module;

s1.2, respectively extracting and learning behavior characteristics of a camera wearer and an interactor, wherein the behavior characteristics comprise static appearance characteristics and dynamic movement characteristics; the static appearance characteristic is the characteristic of the static visual content seen by the camera wearer, namely the video frame I corresponding to the camera wearer _t Global appearance feature of (a) and video frame I of the corresponding interactor _t Is a local appearance feature of (a);

s1.3, motion characteristic learning, wherein for a camera wearer, motion information of the camera wearer is camera motion information, and the influence of the motion information on video frame change is global; for interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R ^{H x W} Representing the transport of interactorsDynamic information and mask M generated by sum attention module _t ⁽³⁾ Gradually multiplying to enable the motion matrix D to only act on interactors and not act on the background;

s1.4 for each pair of adjacent video frames I _t-1 ，I _t The global static appearance characteristic f corresponding to the camera wearer is obtained through the attention module and the motion module respectively _t ^g,a And motion characteristics f _t ^g,m And the local static appearance characteristic f corresponding to the interactor _t ^l,a And motion characteristics f _t ^l,m . The behavior characteristics of the camera wearer are defined as f _t ^ego ＝[f _t ^g,a ，f _t ^g,m ]The behavior characteristics of the interactors are defined as f _t ^exo ＝[f _t ^l,a ，f _t ^l,m ]These two features will be used to model the interaction relationship between the camera wearer and the interactors.

S2, modeling a dual interaction relationship;

s2.1, constructing a long-short-time memory module for interactive modeling;

s2.2, the long-short-time memory module for interactive modeling explicitly models the interaction relationship between the camera wearer and the interactors by taking the output of the last frame of the dual module as the input of the current frame.

As a preferred technical solution, in step S1.1, the attention module specifically includes:

for adjacent two of the videos I _t-1 ，I _t ∈R ^{H x W x 3} Wherein t is the frame number, H and W are the height and width of the video frame respectively, 3 is the channel number of the video frame, and represents RGB three channels, and a depth convolution neural network is used for extracting the features respectively to obtain the visual features f corresponding to the two frames _t-1 ，f _t ∈R ^{H0 x W0 x C} Wherein H0, W0 are the height and width of the feature map, respectively, C is the number of channels of the features, and the features extract static appearance information in the video; the attention module is arranged on the visual characteristic f _t Adding a series of deconvolution layers to obtain a group of masks M with different sizes _t ⁽⁰⁾ ，M _t ⁽¹⁾ ，M _t ⁽²⁾ ，M _t ⁽³⁾ Wherein M is _t ⁽⁰⁾ The size of (a) is H0x W0, i.e. the sum of the features f _t Equal in size, and M _t ⁽³⁾ Is H x W, i.e. is equal to video frame I _t Is of the same size, M _t ⁽⁰⁾ For slave feature f _t Separating the appearance characteristics of the interactors, and M _t ⁽³⁾ For use in a subsequent motion profile module.

As a preferred technical scheme, the method further comprises the following steps:

the human body analysis model and a corresponding human body segmentation loss auxiliary attention module are introduced into the attention module to learn, and the method specifically comprises the following steps:

video frame I using existing human body analytic model JPPNet _t Generating a corresponding reference mask M _t ^RF The human body segmentation loss corresponding to the attention model is as follows:

as a preferred technical solution, in step S1.2, the static appearance feature is extracted by:

mask M generated based on attention module _t ⁽⁰⁾ The static appearance characteristics of the camera wearer and the interactors are extracted respectively, which are invisible in the first person video for the camera wearer, so that the static appearance characteristics are defined as the characteristics of the static visual content seen by the camera wearer, namely the video frame I _t Is characterized by the global appearance of:

wherein i, j is a feature map f _t Two-dimensional coordinate index on the upper, global appearance feature f _t ^g,a Is a vector of dimension C;

for interactors, annotateMask M generated by the intent module _t ⁽⁰⁾ Gives the spatial position information of the interactors, and the static appearance characteristics of the interactors can be obtained through the appearance characteristics f of the video frames _t Sum mask M _t ⁽⁰⁾ The method comprises the following steps:

as a preferable technical solution, in step S1.3, the dynamic characteristics are:

for the camera wearer, a global transformation matrix T E R is used ^{3 x 3} Describing global motion information, for each pixel position (x, y) in a video frame, expanding it into a three-dimensional vector p= (x, y, 1) ^T Then, transforming T by using a global transformation matrix T, wherein the transformation result is equivalent to rotation, scaling and translation of the video frame;

for interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R ^{H x W} To represent the movement information of the interactors and through a mask M generated by the attention module _t ⁽³⁾ Gradually multiplying so that the motion matrix D acts only on the interactors and not on the background, giving the global transformation matrix T and the local motion matrix D, and the coordinate matrix X of the T-1 frame _t-1 Coordinate matrix X of predicted t frame _t ：

T frame coordinate matrix X obtained by prediction _t Reconstructing t-frame video frame I by interpolation _t And learning T and M through reconstruction loss between the reconstructed video frame and the real video frame:

where x is the index of each two-dimensional spatial coordinate.

As a preferable technical scheme, the global transformation matrix T and the local motion matrix D are generated by a motion module, and the motion module uses the visual characteristics f of T-1 frames and T frames _t-1 ，f _t As input, f _t-1 And f _t Performing multiplicative comparison to calculate correlation, and comparing the result with f _t-1 And f _t Splicing according to the channel direction to obtain a new feature map, adding a convolution layer on the basis of the feature map, then respectively inputting two branches, and carrying out global pooling on one branch to obtain a global motion feature f corresponding to a camera wearer _t ^g,m Adding a full connection layer on the global feature to obtain a global transformation matrix T; mask M generated by another branch and attention module _t ⁽⁰⁾ Performing point-by-point multiplication to obtain a characteristic, adding three deconvolution layers and one convolution layer to obtain a local motion matrix D corresponding to the size of a video frame, and performing global pooling on the characteristic before the deconvolution layer to obtain a local motion characteristic f corresponding to an interactor _t ^l,m 。

As an preferable technical solution, in step S2.1, the long-short-time memory module for interactive modeling is specifically constructed as follows:

the individual behavior characteristics of the camera wearer and the interactor are respectively input into corresponding long-short-time memory modules, the two modules are dual modules, and a symmetrical updating mode is adopted:

[i _t ；o _t ；g _t ；a _t ]＝σ(Wf _t +UF _t-1 +J _t-1 +b)

c _t ＝i _t a _t +g _t c _t-1

F _t ＝o _t tanh(c _t )

wherein i is _t ，o _t ，g _t ，a _t Input threshold value, output threshold value, forgetting threshold value and input characteristic of long-short time memory module respectively, sigma nonlinear function sigmoid function, phi linear rectification function, f _t Is the individual behavior characteristic of the camera wearer or interactor, ct is the intermediate characteristic of the long-short-time memory module, F _t The output characteristics of the corresponding long-short-time memory module are F _t ^* Is an output characteristic of the dual module.

As a preferred technical solution, in step S2.2, the method further includes the following steps:

the output of the two long-short-time memory modules in the last frame N of the video is added point by point, and the fused characteristics are obtained through nonlinear operation:

at R _N Adding a linear classifier, and obtaining the probability corresponding to each behavior category through a softmax function:

p(y|R _N )＝softmax(WR _N +b)

optimizing the classification result using a cross entropy loss function:

wherein y is _k Tags of category k, i.e. y if the behavior category number is k _k =1, otherwise y _k =0; k is the total category number.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention provides a camera wearer and an interactor which are explicitly separated and respectively learn the characteristics of the camera wearer and the interactor, and the interactor is separated from the background through an attention module, so that static appearance characteristics respectively corresponding to the camera wearer and the interactor are obtained, and the movement characteristics respectively corresponding to the camera wearer and the interactor are learned through a movement module. Based on the behavior characteristics of the two, the invention further provides a long-short-time memory module for modeling the explicit interaction relationship, so that the interaction behavior between the camera wearer and the interactors is described and identified. By the technical scheme, different interaction behaviors under the first person view angle can be well identified and classified, and better performance on each research data set is obtained. Therefore, the first person visual angle interaction behavior recognition model based on explicit interaction modeling is an effective interaction behavior recognition model, and can be deployed in an intelligent wearable device system, so that the intelligent system can automatically recognize and process different interaction behaviors.

Drawings

Fig. 1 is a schematic view of the structure of the attention module of the present invention.

FIG. 2 is a schematic diagram of the results of the motion module of the present invention;

FIG. 3 is a schematic diagram of a long and short term memory module for interactive modeling according to the present invention.

FIG. 4 is a flow chart of a first person perspective video interactive behavior recognition method based on interactive modeling.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Examples

The invention aims to solve the problem that an intelligent video analysis system needs to identify the behavior category of a person in a video given a video clip. The intelligent video analysis system based on the wearable equipment is worn on a certain person by the camera, the video is at a first person view angle, and the interaction behavior category of the person wearing the intelligent video analysis system and the other person under the first person view angle is required to be identified. The main first person visual angle interactive behavior recognition method at present mainly adopts a method similar to the behavior recognition under the third person visual angle, and the characteristic learning is directly carried out from the static appearance and the dynamic motion information of the whole video, so that a camera wearer and an interactor generating interactive behaviors with the camera wearer are not explicitly separated and the relationship between the camera wearer and the interactor is modeled. The invention aims at the problem of identifying the interaction behavior of a first person viewing angle under a single viewing angle, and provides a method for explicitly separating a camera wearer from an interactor to learn characteristics respectively and modeling the relationship between the camera wearer and the interactor by using a dual relationship model. In order to explicitly separate a camera wearer from an interactor, the invention provides an attention module on the characteristics of a deep convolutional neural network, and a human body analysis algorithm is used for assisting the attention module to learn, so that the interactor and surrounding information can be separated from the background; in order to learn the motion characteristics of a camera wearer and an interactor respectively, the invention further provides a motion information module on the basis of the attention module, a global motion matrix and a dense local motion matrix are respectively learned aiming at the camera wearer and the interactor, the two matrices are acted on a certain frame in a video to try to reconstruct the corresponding next frame, and the two motion matrices and the corresponding motion characteristics are learned through reconstruction errors. The invention further combines the learned motion characteristics with the static appearance characteristics based on the convolutional neural network, and provides a dual long-short-time memory model to model the relationship between the camera wearer and the interactors. Through explicitly separating the camera wearer from the interactors, respectively learning the features and modeling the interaction relationship between the camera wearer and the interactors, the model provided by the invention can well describe and identify the interaction behavior under the first-person visual angle, and the current optimal result is obtained on the common first-person visual angle interaction pedestrian identification data set.

As shown in fig. 4, the first person view video interaction behavior recognition method based on interaction modeling in this embodiment specifically includes the following steps:

s1, individual behavior feature expression:

to explicitly model the interaction relationship between the camera wearer and the interactors, we first need to separate the two and learn the behavior characteristics of the two separately. The method mainly comprises two steps, wherein the first step is to separate an interactor from the background through an attention module, and the second step is to respectively extract and learn the behavior characteristics of a camera wearer and the interactor, including static appearance characteristics and dynamic movement characteristics.

S1.1, an attention module;

the function of the attention module is to separate the interactors from the background in the first person perspective video. For adjacent two of the videos I _t-1 ，I _t ∈R ^{H x W x 3} Wherein t is the frame number, H and W are the height and width of the video frame respectively, 3 is the channel number of the video frame and represents RGB three channels, we extract the features respectively by a depth convolution neural network ResNet-50 to obtain the visual features f corresponding to the two frames _t-1 ，f _t ∈R ^{H0 x W0 x C} Wherein H0, W0 are the height and width of the feature map, respectively, and C is the number of channels characterized. These features extract static appearance information in the video. The attention module provided by the invention is characterized in visual characteristic f _t Adding a series of deconvolution layers to obtain a group of masks M with different sizes _t ⁽⁰⁾ ，M _t ⁽¹⁾ ，M _t ⁽²⁾ ，M _t ⁽³⁾ Wherein M is _t ⁽⁰⁾ The size of (a) is H0x W0, i.e. the sum of the features f _t Equal in size, and M _t ⁽³⁾ Is H x W, i.e. is equal to video frame I _t Is the same size. M is M _t ⁽⁰⁾ For slave feature f _t Separating the appearance characteristics of the interactors, and M _t ⁽³⁾ For use in a subsequent motion profile module.

Mask M generated for the attention module _t ^(k) (k=0, 1,2, 3) can separate the interactors from the background, and introduce the human body analysis model and a corresponding human body segmentation loss auxiliary attention module for learning. Specifically, video frame I is modeled using an existing human body analytical model JPPNet _t Generating a corresponding reference mask M _t ^RF The human body segmentation loss corresponding to the attention model is as follows:

s1.2, extracting static appearance characteristics;

mask M generated based on attention module _t ⁽⁰⁾ The static appearance characteristics of the camera wearer and the interactors can be extracted respectively. For a camera wearer, which is not visible in the first person video, its static appearance feature is therefore defined as the feature of the static visual content that is visible to the camera wearer, i.e. video frame I _t Is characterized by the global appearance of:

wherein i, j is a feature map f _t Two-dimensional coordinate index on the upper, global appearance feature f _t ^g,a Is a vector of dimension C.

For interactors, attention module generated mask M _t ⁽⁰⁾ Gives the spatial position information of the interactors, and the static appearance characteristics of the interactors can be obtained through the appearance characteristics f of the video frames _t Sum mask M _t ⁽⁰⁾ The method comprises the following steps:

this feature describes the local static appearance information of the interactors.

S1.3, learning motion characteristics:

to describe the behavior of a video, it is not enough to rely on static appearance information alone, but it is also necessary to extract dynamic motion information from the video and learn the corresponding motion characteristics for the camera wearer and the interactors, respectively. For the camera wearer, the motion information is the camera motion information, and the influence of the motion information on the video frame change is global. Using oneEach global transformation matrix T epsilon R ^{3 x 3} This global motion information is described. For each pixel position (x, y) in a video frame, it is extended to a three-dimensional vector p= (x, y, 1) ^T Then, transform t×p is performed with the global transform matrix T, and the transform result is equivalent to rotating, scaling, and translating the video frame. Since each pixel in the video frame is transformed identically, the motion information corresponding to this transform T is global.

For interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R ^{H x W} To represent the movement information of the interactors and through a mask M generated by the attention module _t ⁽³⁾ The multiplication is gradual, such that the motion matrix D acts only on the interactors and not on the background. Given a global transformation matrix T and a local motion matrix D, and a coordinate matrix X of the T-1 frame _t-1 Coordinate matrix X of t frame can be predicted _t ：

T frame coordinate matrix X obtained by prediction _t T-frame video frame I can be reconstructed by interpolation _t And learning T and M through reconstruction loss between the reconstructed video frame and the real video frame:

where x is the index of each two-dimensional spatial coordinate.

The global transformation matrix T and the local motion matrix D are generated by the motion module. The motion module uses the visual characteristics f of t-1 frames and t frames _t-1 ，f _t As input, f _t-1 And f _t Performing multiplicative comparison to calculate correlation, and comparing the result with f _t-1 And f _t Splicing according to the channel direction to obtain a new feature diagram, adding a convolution layer on the basis of the feature diagram, and then respectively inputting two branchesOne of the branches is subjected to global pooling to obtain global motion characteristics f corresponding to a camera wearer _t ^g,m Adding a full connection layer on the global feature to obtain a global transformation matrix T; mask M generated by another branch and attention module _t ⁽⁰⁾ Performing point-by-point multiplication to obtain a characteristic, adding three deconvolution layers and one convolution layer to obtain a local motion matrix D corresponding to the size of a video frame, and performing global pooling on the characteristic before the deconvolution layer to obtain a local motion characteristic f corresponding to an interactor _t ^l ^,m The parameters of the motion module are learned by the above-described reconstruction loss. Fig. 2 shows a schematic diagram of the result of the movement module.

S1.4, individual behavior characteristics:

for each pair of adjacent video frames I _t-1 ，I _t The global static appearance characteristic f corresponding to the camera wearer is obtained through the attention module and the motion module respectively _t ^g,a And motion characteristics f _t ^g,m And the local static appearance characteristic f corresponding to the interactor _t ^l,a And motion characteristics f _t ^l,m . The behavior characteristics of the camera wearer are defined as f _t ^ego ＝[f _t ^g,a ，f _t ^g,m ]The behavior characteristics of the interactors are defined as f _t ^exo ＝[f _t ^l,a ，f _t ^l,m ]These two features will be used to model the interaction relationship between the camera wearer and the interactors.

S2, modeling a dual interaction relationship;

s2.1, a long-short-time memory module for interactive modeling;

the interaction behavior of the first person viewing angle involves interaction between the camera wearer and the interactor, so that the recognition effect is poor by only using the individual behavior features of the camera wearer and the interactor. In order to model the interaction relation between the two, the invention provides a long-time and short-time memory module for interaction modeling. The individual behavior characteristics of the camera wearer and the interactor are respectively input into corresponding Long Short-Term Memory modules (LSTM), the two modules are dual modules, and a symmetrical updating mode is adopted:

[i _t ；o _t ；g _t ；a _t ]＝σ(Wf _t +UF _t-1 +J _t-1 +b)

c _t ＝i _t a _t +g _t c _t-1

F _t ＝o _t tanh(c _t )

wherein i is _t ，o _t ，g _t ，a _t Input threshold value, output threshold value, forgetting threshold value and input characteristic of long-short time memory module respectively, sigma is nonlinear function sigmoid function, phi is linear rectification function (Rectified Linear Unit, reLU), f _t Is an individual behavioral characteristic of the camera wearer or interactor, i.e. f in section 1.1.4 _t ^ego Or f _t ^exo ，F _t The output characteristics of the corresponding long-short-time memory module are F _t ^* Is an output characteristic of the dual module. I.e. if the camera wearer is concerned, the input characteristic f of the long-short time memory module is recorded _t Is f _t ^ego The output is characterized by F _t F is then _t ^* Output characteristics of the module are memorized for long and short periods corresponding to the interactors and vice versa. W, U, V, b, V are learnable parameters of the long-short-time memory module. For a conventional long-short-time memory module, the input of the t frame comprises the characteristic of the t frame and the output of the module at the t-1 frame; for the long and short memory module for modeling the interaction relationship, the input of the t frame not only comprises the characteristic of the t frame and the output of the t-1 frame of the module, but also comprises the output of the dual module at the t-1 frame, and the characteristic enables the module to model the interaction relationship between the camera carrier and the interactor. The model structure of the long and short term memory module for interactive modeling is shown in fig. 3.

S2.2, identifying interaction behaviors;

the long-short-time memory module for interactive modeling explicitly models the interaction relationship between a camera wearer and an interactor by taking the output of the dual module from the last frame as the input of the current frame. Finally, in order to identify the interaction behavior, the outputs of the two long-short-time memory modules in the last frame N of the video are added point by point, and the fused characteristics are obtained through nonlinear operation:

p(y|R _N )＝softmax(WR _N +b)

optimizing the classification result using a cross entropy loss function:

S2.3, integral learning of model

For human body analytic loss L which is used for assisting attention module to learn is defined _seg Reconstruction loss L for motion feature learning _rec Loss L for classifying interaction behavior _cls . The overall loss function of the model is a weighted sum of the three loss functions:

L＝L _cls +αL _seg +βL _rec

wherein alpha, beta are each L _seg And L _rec The proposed ensemble model performs end-to-end learning based on the loss function.

The first person visual angle interactive behavior recognition based on deep learning at present mainly learns the appearance and the motion characteristics related to the interactive behavior by regarding the interactive behavior between a camera wearer and an interactor as a whole. However, the interaction behavior involves interaction between two people, which have different appearance and motion information, and the appearance and motion information of each of the two people and the interaction relationship between the two people jointly determine the category of the interaction behavior. The interactive behavior is directly described from the whole, so that the interactive relationship can be effectively expressed. According to the scheme provided by the invention, the attention module and the movement module are firstly used for respectively learning the appearance and the movement characteristics corresponding to a camera wearer and an interactor, and then the interaction relation between the camera wearer and the interactor is explicitly modeled through the long-short-time memory module, so that the description of interaction behavior is obtained. The modeling mode is more consistent with the characteristics of the interaction behavior of the first person view angle, so that the method provided by the invention can better describe the interaction behavior of the first person view angle, thereby helping the interaction behavior identification.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. The first person view video interactive behavior identification method based on interactive modeling is characterized by comprising the following steps of:

s1.1, separating an interactor from the background through an attention module;

s1.3, motion characteristic learning, wherein for a camera wearer, motion information of the camera wearer is camera motion information, and the influence of the motion information on video frame change is global; for interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R ^{H x W} To represent the movement information of the interactors and through a mask M generated by the attention module _t ⁽³⁾ Gradually multiplying to enable the motion matrix D to only act on interactors and not act on the background;

s1.4 for each pair of adjacent video frames I _t-1 ，I _t The global static appearance characteristic f corresponding to the camera wearer is obtained through the attention module and the motion matrix respectively _t ^g，a And motion characteristics f _t ^g，m And the local static appearance characteristic f corresponding to the interactor _t ^1，a And motion characteristics f _t ^1，m The behavior characteristics of the camera wearer are defined as f _t ^ego ＝[f _t ^g，a ，f _t ^g，m ]The behavior characteristics of the interactors are defined as f _t ^exo ＝[f _t ^1，a ，f _t ^1，m ]These two features model the interaction relationship between the camera wearer and the interactors;

s2, modeling a dual interaction relationship;

s2.1, constructing a long-short-time memory module for interactive modeling; in step S2.1, the long-short-time memory module for interactive modeling is specifically constructed as follows:

[i _t ；o _t ；g _t ；a _t ]＝σ(Wf _t +UF _t-1 +J _t-1 +b)

c _t ＝i _t a _t +g _t c _t-1

F _t ＝o _t tanh(c _t )

wherein i is _t ，o _t ，g _t ，a _t Input threshold value, output threshold value, forgetting threshold value and input characteristic of long-short time memory module respectively, sigma nonlinear function sigmoid function, phi linear rectification function, f _t Is the individual behavior characteristic of the camera wearer or interactor, ct is the intermediate characteristic of the long-short-time memory module, F _t The output characteristics of the corresponding long-short-time memory module are F _t ^* The output characteristics of the dual module are V, b which are the learnable parameters of the long-short-time memory module;

s2.2, a long-short-time memory module for interactive modeling explicitly models the interaction relationship between a camera wearer and an interactor by taking the output of the last frame of the dual module as the input of the current frame; in step S2.2, the method further comprises the following steps:

p(y|R _N )＝softmax(WR _N +b)

optimizing the classification result using a cross entropy loss function:

wherein y is _k Tags of category k, i.e. y if the behavior category number is k _k =1, otherwisey _k =0; k is the total category number.

2. The method for identifying interaction behavior of a video of a first person viewing angle based on interaction modeling according to claim 1, wherein in step S1.1, the attention module is specifically:

for adjacent two of the videos I _t-1 ，I _t ∈R ^HxWx3 Wherein t is the frame number, H and W are the height and width of the video frame respectively, 3 is the channel number of the video frame, and represents RGB three channels, and a depth convolution neural network is used for extracting the features respectively to obtain the visual features f corresponding to the two frames _t-1 ，f _t ∈R ^H0xW0xC Wherein H0, W0 are the height and width of the feature map, respectively, C is the number of channels of the features, and the features extract static appearance information in the video; the attention module is arranged on the visual characteristic f _t Adding a series of deconvolution layers to obtain a group of masks M with different sizes _t ⁽⁰⁾ ，M _t ⁽¹⁾ ，M _t ⁽²⁾ ，M _t ⁽³⁾ Wherein M is _t ⁽⁰⁾ The magnitude of (2) is H0xW0, i.e. is equal to the characteristic diagram f _t Equal in size, and M _t ⁽³⁾ Is H x W, i.e. is equal to video frame I _t Is of the same size, M _t ⁽⁰⁾ For slave feature f _t Separating the appearance characteristics of the interactors, and M _t ⁽³⁾ For use in a subsequent motion profile module.

3. The method for identifying interaction behavior of a video of a first person viewing angle based on interaction modeling according to claim 2, wherein in step S1.2, the static appearance features are extracted by:

4. the method for identifying interaction behavior of a video of a first person viewing angle based on interaction modeling according to claim 1, wherein in step S1.3, the motion characteristics are:

for the camera wearer, a global transformation matrix T E R is used ^3x3 Describing global motion information, for each pixel position (x, y) in a video frame, expanding it into a three-dimensional vector p= (x, y, 1) ^T Then, transforming T by using a global transformation matrix T, wherein the transformation result is equivalent to rotation, scaling and translation of the video frame;

for interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R ^{H x W} To represent the movement information of the interactors and through a mask M generated by the attention module _t ⁽³⁾ Gradually multiplying so that the motion matrix D acts only on the interactors and not on the background, giving the global transformation matrix T and the local motion matrix DCoordinate matrix X of t-1 frame _t-1 Coordinate matrix X of predicted t frame _t ：

where x is the index of each two-dimensional spatial coordinate.

5. The method for identifying interaction behavior of video in view of first person based on interaction modeling of claim 4, wherein the global transformation matrix T and the local motion matrix D are generated by a motion module, the motion module generating visual features f in T-1 frames and T frames _t-1 ，f _t As input, f _t-1 And f _t Performing multiplicative comparison to calculate correlation, and comparing the result with f _t-1 And f _t Splicing according to the channel direction to obtain a new feature map, adding a convolution layer on the basis of the feature map, then respectively inputting two branches, and carrying out global pooling on one branch to obtain a global motion feature f corresponding to a camera wearer _t ^g,m Adding a full connection layer on the global feature to obtain a global transformation matrix T; mask M generated by another branch and attention module _t ⁽⁰⁾ Performing point-by-point multiplication to obtain a characteristic, adding three deconvolution layers and one convolution layer to obtain a local motion matrix D corresponding to the size of a video frame, and performing global pooling on the characteristic before the deconvolution layer to obtain a local motion characteristic f corresponding to an interactor _t ^l,m 。