CN114821772A

CN114821772A - Weak supervision time sequence action detection method based on time-space correlation learning

Info

Publication number: CN114821772A
Application number: CN202210383307.9A
Authority: CN
Inventors: 夏惠芬; 詹永照; 朱斌
Original assignee: Changzhou Vocational Institute of Mechatronic Technology
Current assignee: Changzhou Vocational Institute of Mechatronic Technology
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-07-29

Abstract

The invention relates to the technical field of computer vision, in particular to a weak supervision time sequence action detection method based on space-time association learning, which comprises the steps of S1, extracting characteristics of video frames through an I3D network; s2, constructing a dynamic space diagram network structure for the video to obtain video space characteristics; s3, constructing a one-dimensional time sequence convolution network to obtain video time sequence characteristics; s4, fusing the time sequence characteristics and the space characteristics; s5, using an action-background attention mechanism, namely action attention and background attention respectively, and respectively pooling original video characteristics; s6, predicting a class activation sequence of the spatial-temporal correlation of the motion and the background in the video, predicting a motion activation sequence or a background activation sequence in the video, and respectively obtaining three classification losses; s7, calculating a total loss function; and S8, using the trained model for motion detection. The invention solves the problems of incomplete and inaccurate action examples in the existing weak supervision time sequence action detection method.

Description

Weak supervision time sequence action detection method based on time-space correlation learning

Technical Field

The invention relates to the technical field of computer vision, in particular to a weak supervision time sequence action detection method based on space-time association learning.

Background

Video time sequence action detection is one of important research subjects in the fields of computer vision and multimedia, and has very wide application in the fields of automatic driving, man-machine interaction, patient monitoring and the like. The task is to analyze and understand videos in actual scenes, and aims to detect types of actions performed by people in the videos and start and stop time of the actions, so that a computer can better detect activities existing in the videos instead of manual work, and the cost of expensive manpower and material resources is reduced. The existing video time sequence action detection mainly comprises three types of full supervision, unsupervised and weak supervision. The detection method of the fully supervised time sequence action requiring frame level marking has the advantages of high cost, time and labor consumption and strong subjectivity; the unsupervised model is trained according to certain preference without any label, so that the development is relatively slow. Therefore, the video-level weak surveillance sequential motion detection method is increasingly popular with researchers.

In the training stage, the weak supervision time sequence action detection method only provides label information of a video level, the working principle of the method is that the classification label of the video level is used for guiding the classification prediction of a segment level, and then an action example is positioned by setting a threshold value. At present, the existing method mainly adopts a multi-example learning or action attention mechanism, and achieves some progress, but the performance is still poorer than that of a full supervision method. Analyzing the reason, we find that most existing methods only model from a single video segment or a single dimension of the video segment, and do not consider the space-time correlation of the action and the joint identification method of the action and the background. In addition, due to the lack of frame-level annotations, we do not know where actions in the video start and end; the action is generated by a continuous process; the actions of the same category have higher spatiotemporal correlation, and the action segments and the background segments have difference to a certain extent; therefore, correlation learning between action segments and accurate separation between actions and background are two key issues in weakly supervised temporal action detection.

The invention learns the space similarity relation and the time continuity relation of actions in the video clips by constructing a multi-example graph convolution network and a one-dimensional time sequence convolution network, and then obtains the space-time correlation characteristics with higher judgment capability by adopting a characteristic fusion technology. Furthermore, the joint attention mechanism of action and background is introduced to construct a three-branch classification network, which explicitly models action and background. In particular, the base branch is used to predict category activation scores for actions and contexts; two pooling branches are used for activation of actions or backgrounds, respectively. The method can provide representation with discrimination capability, distinguish the action and background segments, and better separate the action from the background, thereby realizing more reliable action positioning and improving the action positioning performance.

Disclosure of Invention

The problems existing in the prior art are as follows: the existing weak supervision time sequence action detection method has the problems that the action examples are incomplete and inaccurate, and particularly, the improvement of action classification and positioning performance is influenced due to the fact that the expression of space-time correlation characteristics and the distinctive description of actions and backgrounds are insufficient and unreasonable.

The technical scheme adopted by the invention is as follows: a weak supervision time sequence action detection method based on space-time correlation learning comprises the following steps:

s1, inputting a video frame sequence, and extracting characteristics of the video frame through an I3D network;

and generating RGB (red, green and blue) features and optical flow features, and splicing the RGB features and the optical flow features to obtain video features.

S2, constructing a dynamic space diagram network structure for the video, and learning the spatial similarity relation between video frames to obtain the video spatial characteristics;

furthermore, each frame sampled in the video is used as a node of a graph network structure, and the similarity between the nodes is used as the weight of an edge; measuring the relation between frames by using a cosine similarity function to obtain an adjacent matrix of the space map, setting a threshold value, and discarding edges with weights smaller than the threshold value;

the neighboring matrix of the spatial map is represented as follows:

wherein x is _i ，x _j Is a feature of video frame i, j, i, j ═ 1,2, …, T;

setting a threshold δ for discarding edges with weights less than the threshold reduces the complexity of the graph convolution network, and the formula is as follows:

based on matrix A _ij Constructing edges of graph, initial feature X ⁽⁰⁾ X is the input to the first layer, and the transform is performed by performing a graph convolution operation, the transform formula being:

wherein l ≧ 1 denotes the number of layers of the graph convolution network,

is the output of the last layer of the graph convolution network, W ^(l-1) Is the weight matrix that needs to be learned,

is that

The degree of regularization of the laplacian of (c),

is that

The degree matrix of (a) is,

representing a matrix with self-circulation, ReLU is the activation function.

S3, constructing a one-dimensional time sequence convolution network, and learning a time sequence continuous relation between video frames to obtain video time sequence characteristics;

further, a one-dimensional time sequence convolution network is adopted to aggregate the characteristics of adjacent segments of the time sequence convolution network so as to obtain updated time sequence characteristics, and the specific expression is as follows:

X ^T ＝Φ ^t (X,W ^t ) (4)

wherein phi ^t Representing a 1-dimensional time-convolutional network, W ^t Are the weight parameters that need to be learned.

S4, fusing the space-time characteristics to obtain space-time correlation fusion characteristics with better discrimination capability;

further, the fusion of the spatial feature and the temporal feature is expressed by the following formula:

X ^Fusion ＝f ^Fusion (X,W) (5)

wherein the content of the first and second substances,

e is the dimension of the space-time correlation characteristic, W ═ W ^(l-1) ,W ^t Is the parameter set of the graph convolution network and the one-dimensional convolution network.

S5, respectively generating space-time associated action pooling characteristics and background pooling characteristics by using an action-background joint attention mechanism;

further, the action-background joint attention mechanism consists of two 1-D convolution networks, a softmax layer is connected behind the convolution networks, the output is an attention matrix of T multiplied by 2, and the expression of the attention matrix is as follows:

wherein, the first and the second end of the pipe are connected with each other,

Φ ^att is a filter mechanism comprising a plurality of convolution layers, W ^att Parameters to be learned;

to generate attention for actions and backgrounds, the softmax activation function is performed along dimension 2 of the T2 matrix, resulting in a probability of an action or a background

The formula is as follows:

where i is { act, bkg },

representing the probability that the tth frame in the video is an action or background, a ^act And a ^bkg A first column and a second column of action-background attention maps, respectively;

using the respective attention weights, the motion pooled and background pooled features are generated separately, as follows:

wherein the content of the first and second substances,

is the attention of the action or the background,

representing element-by-element multiplication along the timing dimension;

the associated features of action or context pooling are represented using the following function, the expression of which is as follows:

X ^i_Fusion ＝f ^Fusion (X ⁱ ,W) (9)。

s6, constructing a three-branch classification network, predicting a class activation sequence of spatio-temporal association of an action and a background in a video by using a basic branch, predicting an action class activation sequence and a background class activation sequence in the video by using two pooling branches with opposite training targets, averaging the three class activation sequences by Top-k to obtain three video-level class activation scores, and finally obtaining three classification loss functions by adopting cross entropy;

further, the space-time correlation fusion features are sent into a classifier composed of full connection layers, class activation scores of frame levels are aggregated along a time dimension by adopting a top-k averaging method, and a binary cross entropy loss function is compared with a group Truth to obtain the classification loss of the space-time correlation fusion features;

further, firstly, the formula of the classifier composed of the full connection layer and the space-time association fusion features is as follows:

Α _base ＝f _cls (X ^Fusion ,W ^cls ) (10)

wherein the content of the first and second substances,

c represents an action class, C +1 represents C action classes plus one background class, f _cls Is a classifier, W ^cls Is a parameter to be learned;

secondly, the formula for aggregating class activation scores of the frame level along the time dimension by adopting a top-k averaging method is as follows:

wherein a is ∈ A and is A _base The values in the matrix are then compared to each other,

m is used to control the number of selected frames;

along the category dimension, calculating the probability that the video belongs to each category by using a softmax activation function, wherein the expression is as follows:

wherein the content of the first and second substances,

c' represents an action category;

and finally, comparing the binary cross entropy loss function with the Ground Truth to obtain a classification loss formula of the space-time association fusion characteristics:

wherein, N is the number of videos,

is a regularized video-level label,

and is

Is a video v _n The tag vector of (2);

further, the time-space characteristics of the action pooling and the time-space characteristics of the background pooling are respectively sent to the action pooling branch and the background pooling branch to obtain a class activation sequence; aggregating class activation scores of the frame level along the time dimension by adopting a top-k averaging method; comparing the binary cross entropy loss function with the Ground Truth to obtain the classification loss in the action and background pooling branches;

correlated features X pooled by action or context ^i_Fusion Respectively sending the three-branch classification network into an action pooling branch and a background pooling branch to obtain class activation sequences in the branches, wherein the formula is as follows:

Α _i ＝f _cls (X ^i_Fusion ,W ^cls ) (14)

wherein f is _cls Is a classifier, W ^cls Is a shared parameter that needs to be learned;

then, obtaining the class activation score of the video level, and executing softmax to generate class-by-class activation probability; also using a binary cross-entropy function as the classification penalty, the classification penalty expression is as follows:

wherein, N is the number of videos,

is a regularized video level label in the action or background pooling branch,

is a video v _n Tags in action pooling branches;

is a video v _n Tags in the background pooling branch;

s7, training a network model by combining a cross entropy classification loss function in the three-branch classification network, and calculating a total loss value, wherein the loss value tends to 0 more, which indicates that the model is more accurate;

further, the total loss value is calculated by the following formula:

L _cls ＝L _base +L _act +L _bkg (16)

wherein L is _base 、L _act 、L _bkg The classification loss of the basic branch, the classification loss of the action pooling branch, and the classification loss of the background pooling branch, respectively.

S8, using the trained model for motion detection, using a class activation sequence generated by a motion pooling branch in the model for final motion positioning and classification, setting a classification threshold value for motion positioning, setting a positioning threshold value, combining continuous frames with class activation scores larger than or equal to the positioning threshold value to form a candidate motion proposal, and deleting motion prediction with high overlapping degree in the candidate proposal by a non-maximum inhibition method to obtain a motion detection result.

The beneficial effects of the invention are:

1. the invention provides a novel weak supervision time sequence action detection method based on time-space correlation learning, which is characterized in that a space similarity relation and a time continuity relation between actions are learned respectively by constructing a graph convolution network and a one-dimensional time sequence convolution network, and a more effective time-space correlation characteristic is generated by fusion, so that a more judgment-capable expression is provided for action classification and positioning;

2. an attention mechanism combining actions and backgrounds is provided, and a three-branch classification network is adopted to clearly model the actions and the backgrounds; the basic branch is used for predicting class activation scores of actions and backgrounds and serves as a positive sample of the actions and the backgrounds, the two pooled branches are used for predicting activation of the actions or the backgrounds and serve as negative samples of the backgrounds or the actions respectively, and the classification network can better distinguish the actions and the backgrounds and promote accuracy of action positioning.

Drawings

FIG. 1 is a flow chart of the method for detecting the weakly supervised time series action based on the spatiotemporal association learning of the present invention;

FIG. 2 is a flow chart of spatiotemporal association fusion learning of the present invention;

fig. 3 is a flow chart of the three-branch classification network of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic drawings and illustrate only the basic structure of the invention in a schematic manner, and therefore only show the structures relevant to the invention.

As shown in fig. 1 and 2, a weakly supervised time series action detection method based on spatio-temporal association learning includes the following steps:

s1 input video frame sequence

Where t is videoFrame number, T is the total number of frames in the video, v _t Is the t-th frame in the video frame sequence number;

extracting features from video frames through I3D network to generate RGB features

And optical flow features

D is the dimension of the feature, the RGB feature and the optical flow feature are spliced, and finally the video feature is obtained

T is the sample length of the video.

S2, constructing a dynamic space map network structure for the video, and learning the space similarity relation between video frames to obtain the space characteristic X of the video ^S ；

Taking each frame sampled in the video as a node of a graph network structure, and taking the similarity between the nodes as the weight of an edge; the more similar the characteristics among the nodes are, the larger the weight value of the edges among the nodes is, and the closer the segments with similar characteristics are; the invention uses a cosine similarity function to measure the relationship between frames, and therefore, the adjacent matrix of the space diagram is represented as follows:

wherein x is _i ，x _j Is a feature of video frame i, j, i, j ═ 1,2, …, T;

setting the threshold δ to 0.75 is used to discard edges with weights less than the threshold, reducing the complexity of the graph convolution network, and the formula is as follows:

based on matrix A _ij Constructing edges of graph, initial feature X ⁽⁰⁾ X being the first layer of the networkInputting, executing graph convolution operation for transformation, wherein the formula is as follows:

wherein l ≧ 1 denotes the number of layers of the graph convolution network,

is that

The degree of regularization of the laplacian of (c),

is that

The degree matrix of (a) is,

representing a matrix with self-circulation, ReLU is the activation function.

S3, constructing a one-dimensional time sequence convolution network, and learning the time sequence continuous relation among video frames to obtain time sequence characteristics X ^T ；

Because each video clip has unique forward and backward adjacent clips in the time dimension, the invention adopts a one-dimensional time sequence convolution network to aggregate the characteristics of the adjacent clips so as to obtain the updated time sequence characteristics, and the specific expression is as follows:

X ^T ＝Φ ^t (X,W ^t ) (4)

S4, adopting fusion technique to specially configure the spaceThe characteristics and the time sequence characteristics are fused to obtain a more discriminative space-time correlation fusion characteristic X ^Fusion ；

For convenience of expression, the present invention uses the following function to represent the generation of spatio-temporal association fusion features, the formula is:

X ^Fusion ＝f ^Fusion (X,W) (5)

wherein the content of the first and second substances,

S5, generating the motion pooling feature X of the space-time correlation of the video through a motion-background joint attention mechanism ^act And background pooling feature X ^bkg ；

The action-background joint attention mechanism consists of two 1-D convolution networks, followed by a softmax layer, the output of which is a T x 2 attention matrix, and the expression is as follows:

wherein the content of the first and second substances,

Φ ^att is a filter mechanism comprising a plurality of convolution layers, W ^att For parameters that need to be learned, in order to generate attention for actions and backgrounds, the softmax activation function is performed along dimension 2 of the T × 2 matrix; let i be { act, bkg }, the formula is as follows:

wherein the content of the first and second substances,

next, using the respective attention weights, the features pooled for the action and the background are generated, respectively, as follows:

wherein the content of the first and second substances,

is the attention of the action or the background,

representing element-by-element multiplication along the timing dimension;

the present invention uses the following function to represent the associated features of action or context pooling, the expression being as follows:

X ^i_Fusion ＝f ^Fusion (X ⁱ ,W) (9)。

s6, constructing a three-branch classification network, and predicting a spatio-temporal correlated class activation sequence A of the action and the background in the video by using the basic branches in the three-branch network as shown in figure 3 _base (ii) a Method for predicting action class activation sequence A in video by using two opposite pooling branches of training targets _act Or background class activation sequence A _bkg ；

As shown in FIG. 3, the spatiotemporal associations are fused to a feature X ^Fusion Sending the data into a classifier consisting of full connection layers, wherein the expression is as follows:

Α _base ＝f _cls (X ^Fusion ,W ^cls ) (10)

wherein the content of the first and second substances,

c +1 represents C action classes plus a background class, f _cls Is a classifier，W ^cls Is a parameter to be learned;

secondly, aggregating class activation scores of the frame level along the time dimension by adopting a top-k averaging method, wherein the formula is as follows:

m-8 is a hyper-parameter that controls the number of selected frames, and then along the category dimension, the probability that the video belongs to each category is calculated using the softmax activation function, as follows:

wherein the content of the first and second substances,

c' represents an action category;

wherein, N is the number of videos,

is a regularized video-level label,

and is provided with

Is a video v _n The tag vector of (a);

also, associated features X that are pooled for action or context ^i_Fusion Respectively sending the three-branch classification network into an action pooling branch and a background pooling branch to obtain class activation sequences in the branches, wherein the formula is as follows:

Α _i ＝f _cls (X ^i_Fusion ,W ^cls ) (14)

wherein, f _cls Is a classifier, W ^cls Is a shared parameter that needs to be learned;

then, obtaining the class activation score of the video level, and executing softmax to generate class-by-class activation probability; also using a binary cross-entropy function as the classification penalty, the expression is as follows:

wherein, N is the number of videos,

is a regularized video level label in the action or background pooling branch,

is a video v _n Tags in action pooling branches;

is a video v _n Tags in the background pooling branch.

S7, training a network model by combining a cross entropy classification loss function in the three-branch classification network, and calculating a total loss value, wherein the calculation formula is as follows:

L _cls ＝L _base +L _act +L _bkg (16)

s8, using the trained model for action detection, the invention uses the class activation sequence A generated by the action pooling branch in the model _act For final action positioning and classification;

setting a classification threshold θ _cls Leave the action category with classification score greater than the threshold for action localization 0.25;

setting a positioning threshold θ _act To enrich the prediction proposal, θ _act Set to multiple threshold position of [0, 0.25%]Step size is 0.025, and continuous frames with class activation scores larger than or equal to a threshold value are combined together to form a candidate action proposal;

and deleting the action prediction with higher overlapping degree in the candidate proposal by adopting a non-maximum suppression method (NMS) to obtain a final action detection result.

In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims

1. A weak supervision time sequence action detection method based on space-time correlation learning is characterized by comprising the following steps:

s4, fusing the time sequence characteristics and the space characteristics to obtain space-time correlation fusion characteristics with better discrimination capability;

s5, respectively generating space-time related action pooling characteristics and background pooling characteristics by using an action-background attention mechanism;

s6, constructing a three-branch classification network, predicting a class activation sequence of the spatial-temporal correlation of the motion and the background in the video by using a basic branch, predicting an motion class activation sequence and a background class activation sequence in the video by using two pooling branches with opposite training targets, averaging the three class activation sequences by Top-k to obtain three video-level class activation scores, and finally obtaining three classification loss functions by adopting cross entropy;

s7, training a network model by combining a cross entropy classification loss function in the three-branch classification network, and calculating a total loss value;

s8, using the trained network model for motion detection, using a class activation sequence generated by a motion pooling branch in the network model for final motion positioning and classification, setting a classification threshold value for motion positioning, setting a positioning threshold value, combining continuous frames of which the class activation scores are greater than or equal to the positioning threshold value to form a candidate motion proposal, and deleting motion prediction with high overlapping degree in the candidate proposal by a non-maximum inhibition method to obtain a motion detection result.

2. The weakly supervised temporal motion detection method based on spatiotemporal association learning of claim 1, wherein the step S2 includes: taking each frame sampled in the video as a node of a graph network structure, and taking the similarity between the nodes as the weight of an edge; measuring the relation between frames by using a cosine similarity function to obtain an adjacent matrix of the space map, setting a threshold value, and discarding edges with weights smaller than the threshold value;

the neighboring matrix of the spatial map is represented as follows:

wherein x is _i ，x _j Is a feature of video frame i, j, i, j ═ 1,2, …, T;

setting a threshold value delta for discarding edges with weights less than the threshold value delta, wherein the neighboring matrix formula of the discarded spatial map is as follows:

wherein l ≧ 1 denotes the number of layers of the graph convolution network,

is that

The degree of regularization of the laplacian of (c),

is that

The degree matrix of (a) is,

representing a matrix with self-circulation, ReLU is the activation function.

3. The weakly supervised temporal motion detection method based on spatiotemporal association learning of claim 1, wherein the step S3 includes: and aggregating the characteristics of the adjacent segments by adopting a one-dimensional time sequence convolution network to obtain the updated time sequence characteristics, wherein the formula is as follows:

X ^T ＝Φ ^t (X,W ^t ) (4)

wherein phi ^t Is represented by 1Dimensional time convolution networks, W ^t Are the weight parameters that need to be learned.

4. The weakly supervised temporal sequence action detection method based on spatiotemporal association learning of claim 1, wherein the formula for fusing the temporal characteristics and the spatial characteristics is as follows:

X ^Fusion ＝f ^Fusion (X,W) (5)

wherein the content of the first and second substances,

e is the dimension of the space-time correlation feature, W ═ W ^(l-1) ,W ^t Is the parameter set of the graph convolution network and the one-dimensional convolution network.

5. The weakly supervised time series action detection method based on spatio-temporal correlation learning as claimed in claim 1, wherein the action-background attention mechanism is composed of two 1-D convolution networks, followed by a softmax layer, and outputs a T x 2 attention matrix, and the expression of the attention matrix is as follows:

wherein the content of the first and second substances,

performing the softmax activation function along dimension 2 of the T2 matrix, resulting in a probability of action or background

The formula is as follows:

where i is { act, bkg },

representing the probability that the tth frame in the video is an action or background, a ^act And a ^bkg A first column and a second column of attention maps of action and background, respectively;

the features pooled for action and background are generated separately, with the following formula:

wherein the content of the first and second substances,

is the attention of the action or the background,

representing element-by-element multiplication along the timing dimension;

the associated features of action or context pooling are represented using equation (9), which is as follows:

X ^i_Fusion ＝f ^Fusion (X ⁱ ,W) (9)。

6. the weakly supervised temporal motion detection method based on spatiotemporal association learning of claim 1, wherein the step S6 includes: and sending the space-time association fusion features into a classifier consisting of full connection layers, aggregating class activation scores of the frame level along the time dimension by adopting a top-k mean value method, and comparing with the Ground Truth by using a binary cross entropy loss function to obtain the classification loss of the space-time association fusion features.

7. The weakly supervised temporal sequence action detection method based on spatio-temporal association learning of claim 6, wherein the formula for feeding the spatio-temporal association fusion features into the classifier composed of fully connected layers is as follows:

Α _base ＝f _cls (X ^Fusion ,W ^cls ) (10)

wherein the content of the first and second substances,

c represents action type, C +1 represents C action types plus one background

Class, f _cls Is a classifier, W ^cls Is a parameter to be learned;

the formula of the top-k averaging method for aggregating class activation scores at the frame level along the time dimension is as follows:

m is used to control the number of selected frames;

wherein the content of the first and second substances,

c' represents an action category;

the formula of the classification loss of the space-time correlation fusion characteristics is as follows:

wherein, N is the number of videos,

is a regularized video-level label,

and is

Is a video v _n The tag vector of (2).

8. The weakly supervised temporal sequence action detection method based on spatiotemporal association learning of claim 1, wherein the step S6 further includes: respectively sending the action pooling space-time characteristics and the background pooling space-time characteristics into an action pooling branch and a background pooling branch to respectively obtain action and background pooling class activation sequences; aggregating class activation scores of the frame level along the time dimension by adopting a top-k averaging method; and comparing the binary cross entropy loss function with the Ground Truth to obtain the classification loss in the action and background pooling branches.

9. The weakly supervised temporal sequence action detection method based on spatiotemporal association learning of claim 8, wherein the formula of the action and background pooled class activation sequence is:

Α _i ＝f _cls (X ^i_Fusion ,W ^cls ) (14)

the formula for the classification penalty in the action and background pooling branches is:

whereinI ═ act, bkg, N is the number of videos,

is a regularized video level label in the action or background pooling branch,

is a video v _n Tags in action pooling branches;

is a video v _n Tags in the background pooling branch.

10. The weakly supervised temporal sequence action detection method based on spatiotemporal association learning of claim 1, wherein the calculation formula of the total loss value is as follows:

L _cls ＝L _base +L _act +L _bkg (16)