CN111241963B - First person view video interactive behavior identification method based on interactive modeling - Google Patents

First person view video interactive behavior identification method based on interactive modeling Download PDF

Info

Publication number
CN111241963B
CN111241963B CN202010009544.XA CN202010009544A CN111241963B CN 111241963 B CN111241963 B CN 111241963B CN 202010009544 A CN202010009544 A CN 202010009544A CN 111241963 B CN111241963 B CN 111241963B
Authority
CN
China
Prior art keywords
motion
module
interactors
video
interactor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010009544.XA
Other languages
Chinese (zh)
Other versions
CN111241963A (en
Inventor
郑伟诗
蔡祎俊
李昊昕
陈立
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202010009544.XA priority Critical patent/CN111241963B/en
Publication of CN111241963A publication Critical patent/CN111241963A/en
Application granted granted Critical
Publication of CN111241963B publication Critical patent/CN111241963B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a first-person visual angle video interaction behavior recognition method based on interaction modeling, which is used for separating a camera wearer from an interactor, respectively learning corresponding static appearance and dynamic motion characteristics of the camera wearer and then explicitly modeling an interaction relationship between the camera wearer and the interactor. Generating a mask by using an attention model and assisting learning of the attention model by using a human body analysis model in order to separate an interactor from the background; and providing a motion module to respectively predict motion information matrixes corresponding to the camera wearer and the interactors, and assisting in learning the motion module through reconstructing the next frame. Finally, a dual long and short time memory module for interactive modeling is provided, and the interactive relation is explicitly modeled on the basis of the module. The invention can well describe and identify the interaction behavior of the first person view angle, and obtain the current better identification result on the common first person view angle interaction behavior research data set.

Description

First person view video interactive behavior identification method based on interactive modeling
Technical Field
The invention belongs to the technical field of behavior recognition, and particularly relates to a first person view video interactive behavior recognition method based on interactive modeling.
Background
At present, the main first person group behavior recognition method is divided into two types, wherein one type uses motion characteristics of manual design such as motion trail, optical flow and the like, and combines with traditional classifiers such as a support vector machine and the like; another type of feature learning is performed by deep learning, which uses a model similar to the video behavior recognition of the third person's viewing angle, and uses a convolutional neural network and a long-short-term memory model to directly learn behavior features from video frames.
The main disadvantage of the above prior art is that there is no explicit modeling of the interaction relationship between the camera wearer and the interactors. The prior art generally directly learns the overall characteristics of the interaction behavior, but the interaction behavior of the first person view is generated by the interaction between the camera wearer and the interactors, and the interaction behavior can be better described by explicitly modeling the interaction relationship, so that the prior art lacks explicit interaction modeling, and therefore cannot be well described.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provides a first-person visual angle video interactive behavior recognition method based on interactive modeling.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the first person view video interactive behavior identification method based on interactive modeling comprises the following steps:
s1, explicitly separating a camera wearer and an interactor, respectively learning behavior characteristics of the camera wearer and the interactor, and comprising the following steps:
s1.1, separating an interactor from the background through an attention module;
s1.2, respectively extracting and learning behavior characteristics of a camera wearer and an interactor, wherein the behavior characteristics comprise static appearance characteristics and dynamic movement characteristics; the static appearance characteristic is the characteristic of the static visual content seen by the camera wearer, namely the video frame I corresponding to the camera wearer t Global appearance feature of (a) and video frame I of the corresponding interactor t Is a local appearance feature of (a);
s1.3, motion characteristic learning, wherein for a camera wearer, motion information of the camera wearer is camera motion information, and the influence of the motion information on video frame change is global; for interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R H x W Representing the transport of interactorsDynamic information and mask M generated by sum attention module t (3) Gradually multiplying to enable the motion matrix D to only act on interactors and not act on the background;
s1.4 for each pair of adjacent video frames I t-1 ,I t The global static appearance characteristic f corresponding to the camera wearer is obtained through the attention module and the motion module respectively t g,a And motion characteristics f t g,m And the local static appearance characteristic f corresponding to the interactor t l,a And motion characteristics f t l,m . The behavior characteristics of the camera wearer are defined as f t ego =[f t g,a ,f t g,m ]The behavior characteristics of the interactors are defined as f t exo =[f t l,a ,f t l,m ]These two features will be used to model the interaction relationship between the camera wearer and the interactors.
S2, modeling a dual interaction relationship;
s2.1, constructing a long-short-time memory module for interactive modeling;
s2.2, the long-short-time memory module for interactive modeling explicitly models the interaction relationship between the camera wearer and the interactors by taking the output of the last frame of the dual module as the input of the current frame.
As a preferred technical solution, in step S1.1, the attention module specifically includes:
for adjacent two of the videos I t-1 ,I t ∈R H x W x 3 Wherein t is the frame number, H and W are the height and width of the video frame respectively, 3 is the channel number of the video frame, and represents RGB three channels, and a depth convolution neural network is used for extracting the features respectively to obtain the visual features f corresponding to the two frames t-1 ,f t ∈R H0 x W0 x C Wherein H0, W0 are the height and width of the feature map, respectively, C is the number of channels of the features, and the features extract static appearance information in the video; the attention module is arranged on the visual characteristic f t Adding a series of deconvolution layers to obtain a group of masks M with different sizes t (0) ,M t (1) ,M t (2) ,M t (3) Wherein M is t (0) The size of (a) is H0x W0, i.e. the sum of the features f t Equal in size, and M t (3) Is H x W, i.e. is equal to video frame I t Is of the same size, M t (0) For slave feature f t Separating the appearance characteristics of the interactors, and M t (3) For use in a subsequent motion profile module.
As a preferred technical scheme, the method further comprises the following steps:
the human body analysis model and a corresponding human body segmentation loss auxiliary attention module are introduced into the attention module to learn, and the method specifically comprises the following steps:
video frame I using existing human body analytic model JPPNet t Generating a corresponding reference mask M t RF The human body segmentation loss corresponding to the attention model is as follows:
Figure BDA0002356619840000031
as a preferred technical solution, in step S1.2, the static appearance feature is extracted by:
mask M generated based on attention module t (0) The static appearance characteristics of the camera wearer and the interactors are extracted respectively, which are invisible in the first person video for the camera wearer, so that the static appearance characteristics are defined as the characteristics of the static visual content seen by the camera wearer, namely the video frame I t Is characterized by the global appearance of:
Figure BDA0002356619840000032
wherein i, j is a feature map f t Two-dimensional coordinate index on the upper, global appearance feature f t g,a Is a vector of dimension C;
for interactors, annotateMask M generated by the intent module t (0) Gives the spatial position information of the interactors, and the static appearance characteristics of the interactors can be obtained through the appearance characteristics f of the video frames t Sum mask M t (0) The method comprises the following steps:
Figure BDA0002356619840000041
Figure BDA0002356619840000042
as a preferable technical solution, in step S1.3, the dynamic characteristics are:
for the camera wearer, a global transformation matrix T E R is used 3 x 3 Describing global motion information, for each pixel position (x, y) in a video frame, expanding it into a three-dimensional vector p= (x, y, 1) T Then, transforming T by using a global transformation matrix T, wherein the transformation result is equivalent to rotation, scaling and translation of the video frame;
for interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R H x W To represent the movement information of the interactors and through a mask M generated by the attention module t (3) Gradually multiplying so that the motion matrix D acts only on the interactors and not on the background, giving the global transformation matrix T and the local motion matrix D, and the coordinate matrix X of the T-1 frame t-1 Coordinate matrix X of predicted t frame t
Figure BDA0002356619840000043
T frame coordinate matrix X obtained by prediction t Reconstructing t-frame video frame I by interpolation t And learning T and M through reconstruction loss between the reconstructed video frame and the real video frame:
Figure BDA0002356619840000044
where x is the index of each two-dimensional spatial coordinate.
As a preferable technical scheme, the global transformation matrix T and the local motion matrix D are generated by a motion module, and the motion module uses the visual characteristics f of T-1 frames and T frames t-1 ,f t As input, f t-1 And f t Performing multiplicative comparison to calculate correlation, and comparing the result with f t-1 And f t Splicing according to the channel direction to obtain a new feature map, adding a convolution layer on the basis of the feature map, then respectively inputting two branches, and carrying out global pooling on one branch to obtain a global motion feature f corresponding to a camera wearer t g,m Adding a full connection layer on the global feature to obtain a global transformation matrix T; mask M generated by another branch and attention module t (0) Performing point-by-point multiplication to obtain a characteristic, adding three deconvolution layers and one convolution layer to obtain a local motion matrix D corresponding to the size of a video frame, and performing global pooling on the characteristic before the deconvolution layer to obtain a local motion characteristic f corresponding to an interactor t l,m
As an preferable technical solution, in step S2.1, the long-short-time memory module for interactive modeling is specifically constructed as follows:
the individual behavior characteristics of the camera wearer and the interactor are respectively input into corresponding long-short-time memory modules, the two modules are dual modules, and a symmetrical updating mode is adopted:
[i t ;o t ;g t ;a t ]=σ(Wf t +UF t-1 +J t-1 +b)
Figure BDA0002356619840000051
c t =i t a t +g t c t-1
F t =o t tanh(c t )
wherein i is t ,o t ,g t ,a t Input threshold value, output threshold value, forgetting threshold value and input characteristic of long-short time memory module respectively, sigma nonlinear function sigmoid function, phi linear rectification function, f t Is the individual behavior characteristic of the camera wearer or interactor, ct is the intermediate characteristic of the long-short-time memory module, F t The output characteristics of the corresponding long-short-time memory module are F t * Is an output characteristic of the dual module.
As a preferred technical solution, in step S2.2, the method further includes the following steps:
the output of the two long-short-time memory modules in the last frame N of the video is added point by point, and the fused characteristics are obtained through nonlinear operation:
Figure BDA0002356619840000052
at R N Adding a linear classifier, and obtaining the probability corresponding to each behavior category through a softmax function:
p(y|R N )=softmax(WR N +b)
optimizing the classification result using a cross entropy loss function:
Figure BDA0002356619840000061
wherein y is k Tags of category k, i.e. y if the behavior category number is k k =1, otherwise y k =0; k is the total category number.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention provides a camera wearer and an interactor which are explicitly separated and respectively learn the characteristics of the camera wearer and the interactor, and the interactor is separated from the background through an attention module, so that static appearance characteristics respectively corresponding to the camera wearer and the interactor are obtained, and the movement characteristics respectively corresponding to the camera wearer and the interactor are learned through a movement module. Based on the behavior characteristics of the two, the invention further provides a long-short-time memory module for modeling the explicit interaction relationship, so that the interaction behavior between the camera wearer and the interactors is described and identified. By the technical scheme, different interaction behaviors under the first person view angle can be well identified and classified, and better performance on each research data set is obtained. Therefore, the first person visual angle interaction behavior recognition model based on explicit interaction modeling is an effective interaction behavior recognition model, and can be deployed in an intelligent wearable device system, so that the intelligent system can automatically recognize and process different interaction behaviors.
Drawings
Fig. 1 is a schematic view of the structure of the attention module of the present invention.
FIG. 2 is a schematic diagram of the results of the motion module of the present invention;
FIG. 3 is a schematic diagram of a long and short term memory module for interactive modeling according to the present invention.
FIG. 4 is a flow chart of a first person perspective video interactive behavior recognition method based on interactive modeling.
Detailed Description
The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.
Examples
The invention aims to solve the problem that an intelligent video analysis system needs to identify the behavior category of a person in a video given a video clip. The intelligent video analysis system based on the wearable equipment is worn on a certain person by the camera, the video is at a first person view angle, and the interaction behavior category of the person wearing the intelligent video analysis system and the other person under the first person view angle is required to be identified. The main first person visual angle interactive behavior recognition method at present mainly adopts a method similar to the behavior recognition under the third person visual angle, and the characteristic learning is directly carried out from the static appearance and the dynamic motion information of the whole video, so that a camera wearer and an interactor generating interactive behaviors with the camera wearer are not explicitly separated and the relationship between the camera wearer and the interactor is modeled. The invention aims at the problem of identifying the interaction behavior of a first person viewing angle under a single viewing angle, and provides a method for explicitly separating a camera wearer from an interactor to learn characteristics respectively and modeling the relationship between the camera wearer and the interactor by using a dual relationship model. In order to explicitly separate a camera wearer from an interactor, the invention provides an attention module on the characteristics of a deep convolutional neural network, and a human body analysis algorithm is used for assisting the attention module to learn, so that the interactor and surrounding information can be separated from the background; in order to learn the motion characteristics of a camera wearer and an interactor respectively, the invention further provides a motion information module on the basis of the attention module, a global motion matrix and a dense local motion matrix are respectively learned aiming at the camera wearer and the interactor, the two matrices are acted on a certain frame in a video to try to reconstruct the corresponding next frame, and the two motion matrices and the corresponding motion characteristics are learned through reconstruction errors. The invention further combines the learned motion characteristics with the static appearance characteristics based on the convolutional neural network, and provides a dual long-short-time memory model to model the relationship between the camera wearer and the interactors. Through explicitly separating the camera wearer from the interactors, respectively learning the features and modeling the interaction relationship between the camera wearer and the interactors, the model provided by the invention can well describe and identify the interaction behavior under the first-person visual angle, and the current optimal result is obtained on the common first-person visual angle interaction pedestrian identification data set.
As shown in fig. 4, the first person view video interaction behavior recognition method based on interaction modeling in this embodiment specifically includes the following steps:
s1, individual behavior feature expression:
to explicitly model the interaction relationship between the camera wearer and the interactors, we first need to separate the two and learn the behavior characteristics of the two separately. The method mainly comprises two steps, wherein the first step is to separate an interactor from the background through an attention module, and the second step is to respectively extract and learn the behavior characteristics of a camera wearer and the interactor, including static appearance characteristics and dynamic movement characteristics.
S1.1, an attention module;
the function of the attention module is to separate the interactors from the background in the first person perspective video. For adjacent two of the videos I t-1 ,I t ∈R H x W x 3 Wherein t is the frame number, H and W are the height and width of the video frame respectively, 3 is the channel number of the video frame and represents RGB three channels, we extract the features respectively by a depth convolution neural network ResNet-50 to obtain the visual features f corresponding to the two frames t-1 ,f t ∈R H0 x W0 x C Wherein H0, W0 are the height and width of the feature map, respectively, and C is the number of channels characterized. These features extract static appearance information in the video. The attention module provided by the invention is characterized in visual characteristic f t Adding a series of deconvolution layers to obtain a group of masks M with different sizes t (0) ,M t (1) ,M t (2) ,M t (3) Wherein M is t (0) The size of (a) is H0x W0, i.e. the sum of the features f t Equal in size, and M t (3) Is H x W, i.e. is equal to video frame I t Is the same size. M is M t (0) For slave feature f t Separating the appearance characteristics of the interactors, and M t (3) For use in a subsequent motion profile module.
Mask M generated for the attention module t (k) (k=0, 1,2, 3) can separate the interactors from the background, and introduce the human body analysis model and a corresponding human body segmentation loss auxiliary attention module for learning. Specifically, video frame I is modeled using an existing human body analytical model JPPNet t Generating a corresponding reference mask M t RF The human body segmentation loss corresponding to the attention model is as follows:
Figure BDA0002356619840000091
s1.2, extracting static appearance characteristics;
mask M generated based on attention module t (0) The static appearance characteristics of the camera wearer and the interactors can be extracted respectively. For a camera wearer, which is not visible in the first person video, its static appearance feature is therefore defined as the feature of the static visual content that is visible to the camera wearer, i.e. video frame I t Is characterized by the global appearance of:
Figure BDA0002356619840000092
wherein i, j is a feature map f t Two-dimensional coordinate index on the upper, global appearance feature f t g,a Is a vector of dimension C.
For interactors, attention module generated mask M t (0) Gives the spatial position information of the interactors, and the static appearance characteristics of the interactors can be obtained through the appearance characteristics f of the video frames t Sum mask M t (0) The method comprises the following steps:
Figure BDA0002356619840000093
Figure BDA0002356619840000094
this feature describes the local static appearance information of the interactors.
S1.3, learning motion characteristics:
to describe the behavior of a video, it is not enough to rely on static appearance information alone, but it is also necessary to extract dynamic motion information from the video and learn the corresponding motion characteristics for the camera wearer and the interactors, respectively. For the camera wearer, the motion information is the camera motion information, and the influence of the motion information on the video frame change is global. Using oneEach global transformation matrix T epsilon R 3 x 3 This global motion information is described. For each pixel position (x, y) in a video frame, it is extended to a three-dimensional vector p= (x, y, 1) T Then, transform t×p is performed with the global transform matrix T, and the transform result is equivalent to rotating, scaling, and translating the video frame. Since each pixel in the video frame is transformed identically, the motion information corresponding to this transform T is global.
For interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R H x W To represent the movement information of the interactors and through a mask M generated by the attention module t (3) The multiplication is gradual, such that the motion matrix D acts only on the interactors and not on the background. Given a global transformation matrix T and a local motion matrix D, and a coordinate matrix X of the T-1 frame t-1 Coordinate matrix X of t frame can be predicted t
Figure BDA0002356619840000101
T frame coordinate matrix X obtained by prediction t T-frame video frame I can be reconstructed by interpolation t And learning T and M through reconstruction loss between the reconstructed video frame and the real video frame:
Figure BDA0002356619840000102
where x is the index of each two-dimensional spatial coordinate.
The global transformation matrix T and the local motion matrix D are generated by the motion module. The motion module uses the visual characteristics f of t-1 frames and t frames t-1 ,f t As input, f t-1 And f t Performing multiplicative comparison to calculate correlation, and comparing the result with f t-1 And f t Splicing according to the channel direction to obtain a new feature diagram, adding a convolution layer on the basis of the feature diagram, and then respectively inputting two branchesOne of the branches is subjected to global pooling to obtain global motion characteristics f corresponding to a camera wearer t g,m Adding a full connection layer on the global feature to obtain a global transformation matrix T; mask M generated by another branch and attention module t (0) Performing point-by-point multiplication to obtain a characteristic, adding three deconvolution layers and one convolution layer to obtain a local motion matrix D corresponding to the size of a video frame, and performing global pooling on the characteristic before the deconvolution layer to obtain a local motion characteristic f corresponding to an interactor t l ,m The parameters of the motion module are learned by the above-described reconstruction loss. Fig. 2 shows a schematic diagram of the result of the movement module.
S1.4, individual behavior characteristics:
for each pair of adjacent video frames I t-1 ,I t The global static appearance characteristic f corresponding to the camera wearer is obtained through the attention module and the motion module respectively t g,a And motion characteristics f t g,m And the local static appearance characteristic f corresponding to the interactor t l,a And motion characteristics f t l,m . The behavior characteristics of the camera wearer are defined as f t ego =[f t g,a ,f t g,m ]The behavior characteristics of the interactors are defined as f t exo =[f t l,a ,f t l,m ]These two features will be used to model the interaction relationship between the camera wearer and the interactors.
S2, modeling a dual interaction relationship;
s2.1, a long-short-time memory module for interactive modeling;
the interaction behavior of the first person viewing angle involves interaction between the camera wearer and the interactor, so that the recognition effect is poor by only using the individual behavior features of the camera wearer and the interactor. In order to model the interaction relation between the two, the invention provides a long-time and short-time memory module for interaction modeling. The individual behavior characteristics of the camera wearer and the interactor are respectively input into corresponding Long Short-Term Memory modules (LSTM), the two modules are dual modules, and a symmetrical updating mode is adopted:
[i t ;o t ;g t ;a t ]=σ(Wf t +UF t-1 +J t-1 +b)
Figure BDA0002356619840000111
c t =i t a t +g t c t-1
F t =o t tanh(c t )
wherein i is t ,o t ,g t ,a t Input threshold value, output threshold value, forgetting threshold value and input characteristic of long-short time memory module respectively, sigma is nonlinear function sigmoid function, phi is linear rectification function (Rectified Linear Unit, reLU), f t Is an individual behavioral characteristic of the camera wearer or interactor, i.e. f in section 1.1.4 t ego Or f t exo ,F t The output characteristics of the corresponding long-short-time memory module are F t * Is an output characteristic of the dual module. I.e. if the camera wearer is concerned, the input characteristic f of the long-short time memory module is recorded t Is f t ego The output is characterized by F t F is then t * Output characteristics of the module are memorized for long and short periods corresponding to the interactors and vice versa. W, U, V, b, V are learnable parameters of the long-short-time memory module. For a conventional long-short-time memory module, the input of the t frame comprises the characteristic of the t frame and the output of the module at the t-1 frame; for the long and short memory module for modeling the interaction relationship, the input of the t frame not only comprises the characteristic of the t frame and the output of the t-1 frame of the module, but also comprises the output of the dual module at the t-1 frame, and the characteristic enables the module to model the interaction relationship between the camera carrier and the interactor. The model structure of the long and short term memory module for interactive modeling is shown in fig. 3.
S2.2, identifying interaction behaviors;
the long-short-time memory module for interactive modeling explicitly models the interaction relationship between a camera wearer and an interactor by taking the output of the dual module from the last frame as the input of the current frame. Finally, in order to identify the interaction behavior, the outputs of the two long-short-time memory modules in the last frame N of the video are added point by point, and the fused characteristics are obtained through nonlinear operation:
Figure BDA0002356619840000121
at R N Adding a linear classifier, and obtaining the probability corresponding to each behavior category through a softmax function:
p(y|R N )=softmax(WR N +b)
optimizing the classification result using a cross entropy loss function:
Figure BDA0002356619840000122
wherein y is k Tags of category k, i.e. y if the behavior category number is k k =1, otherwise y k =0; k is the total category number.
S2.3, integral learning of model
For human body analytic loss L which is used for assisting attention module to learn is defined seg Reconstruction loss L for motion feature learning rec Loss L for classifying interaction behavior cls . The overall loss function of the model is a weighted sum of the three loss functions:
L=L cls +αL seg +βL rec
wherein alpha, beta are each L seg And L rec The proposed ensemble model performs end-to-end learning based on the loss function.
The first person visual angle interactive behavior recognition based on deep learning at present mainly learns the appearance and the motion characteristics related to the interactive behavior by regarding the interactive behavior between a camera wearer and an interactor as a whole. However, the interaction behavior involves interaction between two people, which have different appearance and motion information, and the appearance and motion information of each of the two people and the interaction relationship between the two people jointly determine the category of the interaction behavior. The interactive behavior is directly described from the whole, so that the interactive relationship can be effectively expressed. According to the scheme provided by the invention, the attention module and the movement module are firstly used for respectively learning the appearance and the movement characteristics corresponding to a camera wearer and an interactor, and then the interaction relation between the camera wearer and the interactor is explicitly modeled through the long-short-time memory module, so that the description of interaction behavior is obtained. The modeling mode is more consistent with the characteristics of the interaction behavior of the first person view angle, so that the method provided by the invention can better describe the interaction behavior of the first person view angle, thereby helping the interaction behavior identification.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (5)

1. The first person view video interactive behavior identification method based on interactive modeling is characterized by comprising the following steps of:
s1, explicitly separating a camera wearer and an interactor, respectively learning behavior characteristics of the camera wearer and the interactor, and comprising the following steps:
s1.1, separating an interactor from the background through an attention module;
s1.2, respectively extracting and learning behavior characteristics of a camera wearer and an interactor, wherein the behavior characteristics comprise static appearance characteristics and dynamic movement characteristics; the static appearance characteristic is the characteristic of the static visual content seen by the camera wearer, namely the video frame I corresponding to the camera wearer t Global appearance feature of (a) and video frame I of the corresponding interactor t Is a local appearance feature of (a);
s1.3, motion characteristic learning, wherein for a camera wearer, motion information of the camera wearer is camera motion information, and the influence of the motion information on video frame change is global; for interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R H x W To represent the movement information of the interactors and through a mask M generated by the attention module t (3) Gradually multiplying to enable the motion matrix D to only act on interactors and not act on the background;
s1.4 for each pair of adjacent video frames I t-1 ,I t The global static appearance characteristic f corresponding to the camera wearer is obtained through the attention module and the motion matrix respectively t g,a And motion characteristics f t g,m And the local static appearance characteristic f corresponding to the interactor t 1,a And motion characteristics f t 1,m The behavior characteristics of the camera wearer are defined as f t ego =[f t g,a ,f t g,m ]The behavior characteristics of the interactors are defined as f t exo =[f t 1,a ,f t 1,m ]These two features model the interaction relationship between the camera wearer and the interactors;
s2, modeling a dual interaction relationship;
s2.1, constructing a long-short-time memory module for interactive modeling; in step S2.1, the long-short-time memory module for interactive modeling is specifically constructed as follows:
the individual behavior characteristics of the camera wearer and the interactor are respectively input into corresponding long-short-time memory modules, the two modules are dual modules, and a symmetrical updating mode is adopted:
[i t ;o t ;g t ;a t ]=σ(Wf t +UF t-1 +J t-1 +b)
Figure FDA0004219302730000011
c t =i t a t +g t c t-1
F t =o t tanh(c t )
wherein i is t ,o t ,g t ,a t Input threshold value, output threshold value, forgetting threshold value and input characteristic of long-short time memory module respectively, sigma nonlinear function sigmoid function, phi linear rectification function, f t Is the individual behavior characteristic of the camera wearer or interactor, ct is the intermediate characteristic of the long-short-time memory module, F t The output characteristics of the corresponding long-short-time memory module are F t * The output characteristics of the dual module are V, b which are the learnable parameters of the long-short-time memory module;
s2.2, a long-short-time memory module for interactive modeling explicitly models the interaction relationship between a camera wearer and an interactor by taking the output of the last frame of the dual module as the input of the current frame; in step S2.2, the method further comprises the following steps:
the output of the two long-short-time memory modules in the last frame N of the video is added point by point, and the fused characteristics are obtained through nonlinear operation:
Figure FDA0004219302730000021
at R N Adding a linear classifier, and obtaining the probability corresponding to each behavior category through a softmax function:
p(y|R N )=softmax(WR N +b)
optimizing the classification result using a cross entropy loss function:
Figure FDA0004219302730000022
wherein y is k Tags of category k, i.e. y if the behavior category number is k k =1, otherwisey k =0; k is the total category number.
2. The method for identifying interaction behavior of a video of a first person viewing angle based on interaction modeling according to claim 1, wherein in step S1.1, the attention module is specifically:
for adjacent two of the videos I t-1 ,I t ∈R HxWx3 Wherein t is the frame number, H and W are the height and width of the video frame respectively, 3 is the channel number of the video frame, and represents RGB three channels, and a depth convolution neural network is used for extracting the features respectively to obtain the visual features f corresponding to the two frames t-1 ,f t ∈R H0xW0xC Wherein H0, W0 are the height and width of the feature map, respectively, C is the number of channels of the features, and the features extract static appearance information in the video; the attention module is arranged on the visual characteristic f t Adding a series of deconvolution layers to obtain a group of masks M with different sizes t (0) ,M t (1) ,M t (2) ,M t (3) Wherein M is t (0) The magnitude of (2) is H0xW0, i.e. is equal to the characteristic diagram f t Equal in size, and M t (3) Is H x W, i.e. is equal to video frame I t Is of the same size, M t (0) For slave feature f t Separating the appearance characteristics of the interactors, and M t (3) For use in a subsequent motion profile module.
3. The method for identifying interaction behavior of a video of a first person viewing angle based on interaction modeling according to claim 2, wherein in step S1.2, the static appearance features are extracted by:
mask M generated based on attention module t (0) The static appearance characteristics of the camera wearer and the interactors are extracted respectively, which are invisible in the first person video for the camera wearer, so that the static appearance characteristics are defined as the characteristics of the static visual content seen by the camera wearer, namely the video frame I t Is characterized by the global appearance of:
Figure FDA0004219302730000031
wherein i, j is a feature map f t Two-dimensional coordinate index on the upper, global appearance feature f t g,a Is a vector of dimension C;
for interactors, attention module generated mask M t (0) Gives the spatial position information of the interactors, and the static appearance characteristics of the interactors can be obtained through the appearance characteristics f of the video frames t Sum mask M t (0) The method comprises the following steps:
Figure FDA0004219302730000032
Figure FDA0004219302730000033
4. the method for identifying interaction behavior of a video of a first person viewing angle based on interaction modeling according to claim 1, wherein in step S1.3, the motion characteristics are:
for the camera wearer, a global transformation matrix T E R is used 3x3 Describing global motion information, for each pixel position (x, y) in a video frame, expanding it into a three-dimensional vector p= (x, y, 1) T Then, transforming T by using a global transformation matrix T, wherein the transformation result is equivalent to rotation, scaling and translation of the video frame;
for interactors, the influence of motion information on video frame transformation is local and passes through a dense motion matrix D E R H x W To represent the movement information of the interactors and through a mask M generated by the attention module t (3) Gradually multiplying so that the motion matrix D acts only on the interactors and not on the background, giving the global transformation matrix T and the local motion matrix DCoordinate matrix X of t-1 frame t-1 Coordinate matrix X of predicted t frame t
Figure FDA0004219302730000041
T frame coordinate matrix X obtained by prediction t Reconstructing t-frame video frame I by interpolation t And learning T and M through reconstruction loss between the reconstructed video frame and the real video frame:
Figure FDA0004219302730000042
where x is the index of each two-dimensional spatial coordinate.
5. The method for identifying interaction behavior of video in view of first person based on interaction modeling of claim 4, wherein the global transformation matrix T and the local motion matrix D are generated by a motion module, the motion module generating visual features f in T-1 frames and T frames t-1 ,f t As input, f t-1 And f t Performing multiplicative comparison to calculate correlation, and comparing the result with f t-1 And f t Splicing according to the channel direction to obtain a new feature map, adding a convolution layer on the basis of the feature map, then respectively inputting two branches, and carrying out global pooling on one branch to obtain a global motion feature f corresponding to a camera wearer t g,m Adding a full connection layer on the global feature to obtain a global transformation matrix T; mask M generated by another branch and attention module t (0) Performing point-by-point multiplication to obtain a characteristic, adding three deconvolution layers and one convolution layer to obtain a local motion matrix D corresponding to the size of a video frame, and performing global pooling on the characteristic before the deconvolution layer to obtain a local motion characteristic f corresponding to an interactor t l,m
CN202010009544.XA 2020-01-06 2020-01-06 First person view video interactive behavior identification method based on interactive modeling Active CN111241963B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010009544.XA CN111241963B (en) 2020-01-06 2020-01-06 First person view video interactive behavior identification method based on interactive modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010009544.XA CN111241963B (en) 2020-01-06 2020-01-06 First person view video interactive behavior identification method based on interactive modeling

Publications (2)

Publication Number Publication Date
CN111241963A CN111241963A (en) 2020-06-05
CN111241963B true CN111241963B (en) 2023-07-14

Family

ID=70874282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010009544.XA Active CN111241963B (en) 2020-01-06 2020-01-06 First person view video interactive behavior identification method based on interactive modeling

Country Status (1)

Country Link
CN (1) CN111241963B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112464875A (en) * 2020-12-09 2021-03-09 南京大学 Method and device for detecting human-object interaction relationship in video
CN112686194B (en) * 2021-01-06 2023-07-18 中山大学 First person visual angle action recognition method, system and storage medium
CN113569756B (en) * 2021-07-29 2023-06-09 西安交通大学 Abnormal behavior detection and positioning method, system, terminal equipment and readable storage medium
CN114581487B (en) * 2021-08-02 2022-11-25 北京易航远智科技有限公司 Pedestrian trajectory prediction method, device, electronic equipment and computer program product
CN115082840B (en) * 2022-08-16 2022-11-15 之江实验室 Action video classification method and device based on data combination and channel correlation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106227341A (en) * 2016-07-20 2016-12-14 南京邮电大学 Unmanned plane gesture interaction method based on degree of depth study and system
CN106845351A (en) * 2016-05-13 2017-06-13 苏州大学 It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term
CN107515674A (en) * 2017-08-08 2017-12-26 山东科技大学 It is a kind of that implementation method is interacted based on virtual reality more with the mining processes of augmented reality
CN109241834A (en) * 2018-07-27 2019-01-18 中山大学 A kind of group behavior recognition methods of the insertion based on hidden variable

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845351A (en) * 2016-05-13 2017-06-13 苏州大学 It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term
CN106227341A (en) * 2016-07-20 2016-12-14 南京邮电大学 Unmanned plane gesture interaction method based on degree of depth study and system
CN107515674A (en) * 2017-08-08 2017-12-26 山东科技大学 It is a kind of that implementation method is interacted based on virtual reality more with the mining processes of augmented reality
CN109241834A (en) * 2018-07-27 2019-01-18 中山大学 A kind of group behavior recognition methods of the insertion based on hidden variable

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于动态手势控制的交互式体三维显示;潘文平等;《光电工程》;20101231(第12期);第88-95页 *

Also Published As

Publication number Publication date
CN111241963A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
CN111241963B (en) First person view video interactive behavior identification method based on interactive modeling
CN111339903B (en) Multi-person human body posture estimation method
CN107341452B (en) Human behavior identification method based on quaternion space-time convolution neural network
CN112926396B (en) Action identification method based on double-current convolution attention
CN109829427B (en) Face clustering method based on purity detection and spatial attention network
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN112307995B (en) Semi-supervised pedestrian re-identification method based on feature decoupling learning
Deng et al. MVF-Net: A multi-view fusion network for event-based object classification
CN111814719A (en) Skeleton behavior identification method based on 3D space-time diagram convolution
CN111368846B (en) Road ponding identification method based on boundary semantic segmentation
CN113344806A (en) Image defogging method and system based on global feature fusion attention network
JP6842745B2 (en) Image discrimination device and image discrimination method
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN112991350A (en) RGB-T image semantic segmentation method based on modal difference reduction
CN113128424A (en) Attention mechanism-based graph convolution neural network action identification method
Grigorev et al. Depth estimation from single monocular images using deep hybrid network
CN112581409A (en) Image defogging method based on end-to-end multiple information distillation network
Wang et al. AAGAN: enhanced single image dehazing with attention-to-attention generative adversarial network
Yin et al. Graph-based normalizing flow for human motion generation and reconstruction
CN112183464A (en) Video pedestrian identification method based on deep neural network and graph convolution network
CN114724251A (en) Old people behavior identification method based on skeleton sequence under infrared video
CN116797787B (en) Remote sensing image semantic segmentation method based on cross-modal fusion and graph neural network
CN111027433A (en) Multiple style face characteristic point detection method based on convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant