CN114821772A - Weak supervision time sequence action detection method based on time-space correlation learning - Google Patents

Weak supervision time sequence action detection method based on time-space correlation learning Download PDF

Info

Publication number
CN114821772A
CN114821772A CN202210383307.9A CN202210383307A CN114821772A CN 114821772 A CN114821772 A CN 114821772A CN 202210383307 A CN202210383307 A CN 202210383307A CN 114821772 A CN114821772 A CN 114821772A
Authority
CN
China
Prior art keywords
action
video
background
pooling
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210383307.9A
Other languages
Chinese (zh)
Inventor
夏惠芬
詹永照
朱斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Vocational Institute of Mechatronic Technology
Original Assignee
Changzhou Vocational Institute of Mechatronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Vocational Institute of Mechatronic Technology filed Critical Changzhou Vocational Institute of Mechatronic Technology
Priority to CN202210383307.9A priority Critical patent/CN114821772A/en
Publication of CN114821772A publication Critical patent/CN114821772A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to the technical field of computer vision, in particular to a weak supervision time sequence action detection method based on space-time association learning, which comprises the steps of S1, extracting characteristics of video frames through an I3D network; s2, constructing a dynamic space diagram network structure for the video to obtain video space characteristics; s3, constructing a one-dimensional time sequence convolution network to obtain video time sequence characteristics; s4, fusing the time sequence characteristics and the space characteristics; s5, using an action-background attention mechanism, namely action attention and background attention respectively, and respectively pooling original video characteristics; s6, predicting a class activation sequence of the spatial-temporal correlation of the motion and the background in the video, predicting a motion activation sequence or a background activation sequence in the video, and respectively obtaining three classification losses; s7, calculating a total loss function; and S8, using the trained model for motion detection. The invention solves the problems of incomplete and inaccurate action examples in the existing weak supervision time sequence action detection method.

Description

Weak supervision time sequence action detection method based on time-space correlation learning
Technical Field
The invention relates to the technical field of computer vision, in particular to a weak supervision time sequence action detection method based on space-time association learning.
Background
Video time sequence action detection is one of important research subjects in the fields of computer vision and multimedia, and has very wide application in the fields of automatic driving, man-machine interaction, patient monitoring and the like. The task is to analyze and understand videos in actual scenes, and aims to detect types of actions performed by people in the videos and start and stop time of the actions, so that a computer can better detect activities existing in the videos instead of manual work, and the cost of expensive manpower and material resources is reduced. The existing video time sequence action detection mainly comprises three types of full supervision, unsupervised and weak supervision. The detection method of the fully supervised time sequence action requiring frame level marking has the advantages of high cost, time and labor consumption and strong subjectivity; the unsupervised model is trained according to certain preference without any label, so that the development is relatively slow. Therefore, the video-level weak surveillance sequential motion detection method is increasingly popular with researchers.
In the training stage, the weak supervision time sequence action detection method only provides label information of a video level, the working principle of the method is that the classification label of the video level is used for guiding the classification prediction of a segment level, and then an action example is positioned by setting a threshold value. At present, the existing method mainly adopts a multi-example learning or action attention mechanism, and achieves some progress, but the performance is still poorer than that of a full supervision method. Analyzing the reason, we find that most existing methods only model from a single video segment or a single dimension of the video segment, and do not consider the space-time correlation of the action and the joint identification method of the action and the background. In addition, due to the lack of frame-level annotations, we do not know where actions in the video start and end; the action is generated by a continuous process; the actions of the same category have higher spatiotemporal correlation, and the action segments and the background segments have difference to a certain extent; therefore, correlation learning between action segments and accurate separation between actions and background are two key issues in weakly supervised temporal action detection.
The invention learns the space similarity relation and the time continuity relation of actions in the video clips by constructing a multi-example graph convolution network and a one-dimensional time sequence convolution network, and then obtains the space-time correlation characteristics with higher judgment capability by adopting a characteristic fusion technology. Furthermore, the joint attention mechanism of action and background is introduced to construct a three-branch classification network, which explicitly models action and background. In particular, the base branch is used to predict category activation scores for actions and contexts; two pooling branches are used for activation of actions or backgrounds, respectively. The method can provide representation with discrimination capability, distinguish the action and background segments, and better separate the action from the background, thereby realizing more reliable action positioning and improving the action positioning performance.
Disclosure of Invention
The problems existing in the prior art are as follows: the existing weak supervision time sequence action detection method has the problems that the action examples are incomplete and inaccurate, and particularly, the improvement of action classification and positioning performance is influenced due to the fact that the expression of space-time correlation characteristics and the distinctive description of actions and backgrounds are insufficient and unreasonable.
The technical scheme adopted by the invention is as follows: a weak supervision time sequence action detection method based on space-time correlation learning comprises the following steps:
s1, inputting a video frame sequence, and extracting characteristics of the video frame through an I3D network;
and generating RGB (red, green and blue) features and optical flow features, and splicing the RGB features and the optical flow features to obtain video features.
S2, constructing a dynamic space diagram network structure for the video, and learning the spatial similarity relation between video frames to obtain the video spatial characteristics;
furthermore, each frame sampled in the video is used as a node of a graph network structure, and the similarity between the nodes is used as the weight of an edge; measuring the relation between frames by using a cosine similarity function to obtain an adjacent matrix of the space map, setting a threshold value, and discarding edges with weights smaller than the threshold value;
the neighboring matrix of the spatial map is represented as follows:
Figure BDA0003593835390000031
wherein x is i ,x j Is a feature of video frame i, j, i, j ═ 1,2, …, T;
setting a threshold δ for discarding edges with weights less than the threshold reduces the complexity of the graph convolution network, and the formula is as follows:
Figure BDA0003593835390000032
based on matrix A ij Constructing edges of graph, initial feature X (0) X is the input to the first layer, and the transform is performed by performing a graph convolution operation, the transform formula being:
Figure BDA0003593835390000033
wherein l ≧ 1 denotes the number of layers of the graph convolution network,
Figure BDA0003593835390000034
is the output of the last layer of the graph convolution network, W (l-1) Is the weight matrix that needs to be learned,
Figure BDA0003593835390000035
is that
Figure BDA0003593835390000036
The degree of regularization of the laplacian of (c),
Figure BDA0003593835390000037
is that
Figure BDA0003593835390000038
The degree matrix of (a) is,
Figure BDA0003593835390000039
representing a matrix with self-circulation, ReLU is the activation function.
S3, constructing a one-dimensional time sequence convolution network, and learning a time sequence continuous relation between video frames to obtain video time sequence characteristics;
further, a one-dimensional time sequence convolution network is adopted to aggregate the characteristics of adjacent segments of the time sequence convolution network so as to obtain updated time sequence characteristics, and the specific expression is as follows:
X T =Φ t (X,W t ) (4)
wherein phi t Representing a 1-dimensional time-convolutional network, W t Are the weight parameters that need to be learned.
S4, fusing the space-time characteristics to obtain space-time correlation fusion characteristics with better discrimination capability;
further, the fusion of the spatial feature and the temporal feature is expressed by the following formula:
X Fusion =f Fusion (X,W) (5)
wherein the content of the first and second substances,
Figure BDA00035938353900000310
e is the dimension of the space-time correlation characteristic, W ═ W (l-1) ,W t Is the parameter set of the graph convolution network and the one-dimensional convolution network.
S5, respectively generating space-time associated action pooling characteristics and background pooling characteristics by using an action-background joint attention mechanism;
further, the action-background joint attention mechanism consists of two 1-D convolution networks, a softmax layer is connected behind the convolution networks, the output is an attention matrix of T multiplied by 2, and the expression of the attention matrix is as follows:
Figure BDA0003593835390000041
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003593835390000042
Φ att is a filter mechanism comprising a plurality of convolution layers, W att Parameters to be learned;
to generate attention for actions and backgrounds, the softmax activation function is performed along dimension 2 of the T2 matrix, resulting in a probability of an action or a background
Figure BDA0003593835390000043
The formula is as follows:
Figure BDA0003593835390000044
where i is { act, bkg },
Figure BDA0003593835390000045
representing the probability that the tth frame in the video is an action or background, a act And a bkg A first column and a second column of action-background attention maps, respectively;
using the respective attention weights, the motion pooled and background pooled features are generated separately, as follows:
Figure BDA0003593835390000046
wherein the content of the first and second substances,
Figure BDA0003593835390000047
is the attention of the action or the background,
Figure BDA0003593835390000048
Figure BDA0003593835390000049
representing element-by-element multiplication along the timing dimension;
the associated features of action or context pooling are represented using the following function, the expression of which is as follows:
X i_Fusion =f Fusion (X i ,W) (9)。
s6, constructing a three-branch classification network, predicting a class activation sequence of spatio-temporal association of an action and a background in a video by using a basic branch, predicting an action class activation sequence and a background class activation sequence in the video by using two pooling branches with opposite training targets, averaging the three class activation sequences by Top-k to obtain three video-level class activation scores, and finally obtaining three classification loss functions by adopting cross entropy;
further, the space-time correlation fusion features are sent into a classifier composed of full connection layers, class activation scores of frame levels are aggregated along a time dimension by adopting a top-k averaging method, and a binary cross entropy loss function is compared with a group Truth to obtain the classification loss of the space-time correlation fusion features;
further, firstly, the formula of the classifier composed of the full connection layer and the space-time association fusion features is as follows:
Α base =f cls (X Fusion ,W cls ) (10)
wherein the content of the first and second substances,
Figure BDA0003593835390000051
c represents an action class, C +1 represents C action classes plus one background class, f cls Is a classifier, W cls Is a parameter to be learned;
secondly, the formula for aggregating class activation scores of the frame level along the time dimension by adopting a top-k averaging method is as follows:
Figure BDA0003593835390000052
wherein a is ∈ A and is A base The values in the matrix are then compared to each other,
Figure BDA0003593835390000053
m is used to control the number of selected frames;
along the category dimension, calculating the probability that the video belongs to each category by using a softmax activation function, wherein the expression is as follows:
Figure BDA0003593835390000054
wherein the content of the first and second substances,
Figure BDA0003593835390000055
c' represents an action category;
and finally, comparing the binary cross entropy loss function with the Ground Truth to obtain a classification loss formula of the space-time association fusion characteristics:
Figure BDA0003593835390000056
wherein, N is the number of videos,
Figure BDA0003593835390000057
is a regularized video-level label,
Figure BDA0003593835390000058
and is
Figure BDA0003593835390000059
Is a video v n The tag vector of (2);
further, the time-space characteristics of the action pooling and the time-space characteristics of the background pooling are respectively sent to the action pooling branch and the background pooling branch to obtain a class activation sequence; aggregating class activation scores of the frame level along the time dimension by adopting a top-k averaging method; comparing the binary cross entropy loss function with the Ground Truth to obtain the classification loss in the action and background pooling branches;
correlated features X pooled by action or context i_Fusion Respectively sending the three-branch classification network into an action pooling branch and a background pooling branch to obtain class activation sequences in the branches, wherein the formula is as follows:
Α i =f cls (X i_Fusion ,W cls ) (14)
wherein f is cls Is a classifier, W cls Is a shared parameter that needs to be learned;
then, obtaining the class activation score of the video level, and executing softmax to generate class-by-class activation probability; also using a binary cross-entropy function as the classification penalty, the classification penalty expression is as follows:
Figure BDA0003593835390000061
wherein, N is the number of videos,
Figure BDA0003593835390000062
is a regularized video level label in the action or background pooling branch,
Figure BDA0003593835390000063
is a video v n Tags in action pooling branches;
Figure BDA0003593835390000064
is a video v n Tags in the background pooling branch;
s7, training a network model by combining a cross entropy classification loss function in the three-branch classification network, and calculating a total loss value, wherein the loss value tends to 0 more, which indicates that the model is more accurate;
further, the total loss value is calculated by the following formula:
L cls =L base +L act +L bkg (16)
wherein L is base 、L act 、L bkg The classification loss of the basic branch, the classification loss of the action pooling branch, and the classification loss of the background pooling branch, respectively.
S8, using the trained model for motion detection, using a class activation sequence generated by a motion pooling branch in the model for final motion positioning and classification, setting a classification threshold value for motion positioning, setting a positioning threshold value, combining continuous frames with class activation scores larger than or equal to the positioning threshold value to form a candidate motion proposal, and deleting motion prediction with high overlapping degree in the candidate proposal by a non-maximum inhibition method to obtain a motion detection result.
The beneficial effects of the invention are:
1. the invention provides a novel weak supervision time sequence action detection method based on time-space correlation learning, which is characterized in that a space similarity relation and a time continuity relation between actions are learned respectively by constructing a graph convolution network and a one-dimensional time sequence convolution network, and a more effective time-space correlation characteristic is generated by fusion, so that a more judgment-capable expression is provided for action classification and positioning;
2. an attention mechanism combining actions and backgrounds is provided, and a three-branch classification network is adopted to clearly model the actions and the backgrounds; the basic branch is used for predicting class activation scores of actions and backgrounds and serves as a positive sample of the actions and the backgrounds, the two pooled branches are used for predicting activation of the actions or the backgrounds and serve as negative samples of the backgrounds or the actions respectively, and the classification network can better distinguish the actions and the backgrounds and promote accuracy of action positioning.
Drawings
FIG. 1 is a flow chart of the method for detecting the weakly supervised time series action based on the spatiotemporal association learning of the present invention;
FIG. 2 is a flow chart of spatiotemporal association fusion learning of the present invention;
fig. 3 is a flow chart of the three-branch classification network of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic drawings and illustrate only the basic structure of the invention in a schematic manner, and therefore only show the structures relevant to the invention.
As shown in fig. 1 and 2, a weakly supervised time series action detection method based on spatio-temporal association learning includes the following steps:
s1 input video frame sequence
Figure BDA0003593835390000071
Where t is videoFrame number, T is the total number of frames in the video, v t Is the t-th frame in the video frame sequence number;
extracting features from video frames through I3D network to generate RGB features
Figure BDA0003593835390000081
And optical flow features
Figure BDA0003593835390000082
D is the dimension of the feature, the RGB feature and the optical flow feature are spliced, and finally the video feature is obtained
Figure BDA0003593835390000083
T is the sample length of the video.
S2, constructing a dynamic space map network structure for the video, and learning the space similarity relation between video frames to obtain the space characteristic X of the video S
Taking each frame sampled in the video as a node of a graph network structure, and taking the similarity between the nodes as the weight of an edge; the more similar the characteristics among the nodes are, the larger the weight value of the edges among the nodes is, and the closer the segments with similar characteristics are; the invention uses a cosine similarity function to measure the relationship between frames, and therefore, the adjacent matrix of the space diagram is represented as follows:
Figure BDA0003593835390000084
wherein x is i ,x j Is a feature of video frame i, j, i, j ═ 1,2, …, T;
setting the threshold δ to 0.75 is used to discard edges with weights less than the threshold, reducing the complexity of the graph convolution network, and the formula is as follows:
Figure BDA0003593835390000085
based on matrix A ij Constructing edges of graph, initial feature X (0) X being the first layer of the networkInputting, executing graph convolution operation for transformation, wherein the formula is as follows:
Figure BDA0003593835390000086
wherein l ≧ 1 denotes the number of layers of the graph convolution network,
Figure BDA0003593835390000087
is the output of the last layer of the graph convolution network, W (l-1) Is the weight matrix that needs to be learned,
Figure BDA0003593835390000088
is that
Figure BDA0003593835390000089
The degree of regularization of the laplacian of (c),
Figure BDA00035938353900000810
is that
Figure BDA00035938353900000811
The degree matrix of (a) is,
Figure BDA00035938353900000812
representing a matrix with self-circulation, ReLU is the activation function.
S3, constructing a one-dimensional time sequence convolution network, and learning the time sequence continuous relation among video frames to obtain time sequence characteristics X T
Because each video clip has unique forward and backward adjacent clips in the time dimension, the invention adopts a one-dimensional time sequence convolution network to aggregate the characteristics of the adjacent clips so as to obtain the updated time sequence characteristics, and the specific expression is as follows:
X T =Φ t (X,W t ) (4)
wherein phi t Representing a 1-dimensional time-convolutional network, W t Are the weight parameters that need to be learned.
S4, adopting fusion technique to specially configure the spaceThe characteristics and the time sequence characteristics are fused to obtain a more discriminative space-time correlation fusion characteristic X Fusion
For convenience of expression, the present invention uses the following function to represent the generation of spatio-temporal association fusion features, the formula is:
X Fusion =f Fusion (X,W) (5)
wherein the content of the first and second substances,
Figure BDA0003593835390000091
e is the dimension of the space-time correlation characteristic, W ═ W (l-1) ,W t Is the parameter set of the graph convolution network and the one-dimensional convolution network.
S5, generating the motion pooling feature X of the space-time correlation of the video through a motion-background joint attention mechanism act And background pooling feature X bkg
The action-background joint attention mechanism consists of two 1-D convolution networks, followed by a softmax layer, the output of which is a T x 2 attention matrix, and the expression is as follows:
Figure BDA0003593835390000092
wherein the content of the first and second substances,
Figure BDA0003593835390000093
Φ att is a filter mechanism comprising a plurality of convolution layers, W att For parameters that need to be learned, in order to generate attention for actions and backgrounds, the softmax activation function is performed along dimension 2 of the T × 2 matrix; let i be { act, bkg }, the formula is as follows:
Figure BDA0003593835390000094
wherein the content of the first and second substances,
Figure BDA0003593835390000095
representing the probability that the tth frame in the video is an action or background, a act And a bkg A first column and a second column of action-background attention maps, respectively;
next, using the respective attention weights, the features pooled for the action and the background are generated, respectively, as follows:
Figure BDA0003593835390000096
wherein the content of the first and second substances,
Figure BDA0003593835390000097
is the attention of the action or the background,
Figure BDA0003593835390000098
Figure BDA0003593835390000099
representing element-by-element multiplication along the timing dimension;
the present invention uses the following function to represent the associated features of action or context pooling, the expression being as follows:
X i_Fusion =f Fusion (X i ,W) (9)。
s6, constructing a three-branch classification network, and predicting a spatio-temporal correlated class activation sequence A of the action and the background in the video by using the basic branches in the three-branch network as shown in figure 3 base (ii) a Method for predicting action class activation sequence A in video by using two opposite pooling branches of training targets act Or background class activation sequence A bkg
As shown in FIG. 3, the spatiotemporal associations are fused to a feature X Fusion Sending the data into a classifier consisting of full connection layers, wherein the expression is as follows:
Α base =f cls (X Fusion ,W cls ) (10)
wherein the content of the first and second substances,
Figure BDA0003593835390000101
c +1 represents C action classes plus a background class, f cls Is a classifier,W cls Is a parameter to be learned;
secondly, aggregating class activation scores of the frame level along the time dimension by adopting a top-k averaging method, wherein the formula is as follows:
Figure BDA0003593835390000102
wherein a is ∈ A and is A base The values in the matrix are then compared to each other,
Figure BDA0003593835390000103
m-8 is a hyper-parameter that controls the number of selected frames, and then along the category dimension, the probability that the video belongs to each category is calculated using the softmax activation function, as follows:
Figure BDA0003593835390000104
wherein the content of the first and second substances,
Figure BDA0003593835390000105
c' represents an action category;
and finally, comparing the binary cross entropy loss function with the Ground Truth to obtain a classification loss formula of the space-time association fusion characteristics:
Figure BDA0003593835390000106
wherein, N is the number of videos,
Figure BDA0003593835390000107
is a regularized video-level label,
Figure BDA0003593835390000108
and is provided with
Figure BDA0003593835390000109
Is a video v n The tag vector of (a);
also, associated features X that are pooled for action or context i_Fusion Respectively sending the three-branch classification network into an action pooling branch and a background pooling branch to obtain class activation sequences in the branches, wherein the formula is as follows:
Α i =f cls (X i_Fusion ,W cls ) (14)
wherein, f cls Is a classifier, W cls Is a shared parameter that needs to be learned;
then, obtaining the class activation score of the video level, and executing softmax to generate class-by-class activation probability; also using a binary cross-entropy function as the classification penalty, the expression is as follows:
Figure BDA0003593835390000111
wherein, N is the number of videos,
Figure BDA0003593835390000112
is a regularized video level label in the action or background pooling branch,
Figure BDA0003593835390000113
is a video v n Tags in action pooling branches;
Figure BDA0003593835390000114
Figure BDA0003593835390000115
is a video v n Tags in the background pooling branch.
S7, training a network model by combining a cross entropy classification loss function in the three-branch classification network, and calculating a total loss value, wherein the calculation formula is as follows:
L cls =L base +L act +L bkg (16)
s8, using the trained model for action detection, the invention uses the class activation sequence A generated by the action pooling branch in the model act For final action positioning and classification;
setting a classification threshold θ cls Leave the action category with classification score greater than the threshold for action localization 0.25;
setting a positioning threshold θ act To enrich the prediction proposal, θ act Set to multiple threshold position of [0, 0.25%]Step size is 0.025, and continuous frames with class activation scores larger than or equal to a threshold value are combined together to form a candidate action proposal;
and deleting the action prediction with higher overlapping degree in the candidate proposal by adopting a non-maximum suppression method (NMS) to obtain a final action detection result.
In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims (10)

1. A weak supervision time sequence action detection method based on space-time correlation learning is characterized by comprising the following steps:
s1, inputting a video frame sequence, and extracting characteristics of the video frame through an I3D network;
s2, constructing a dynamic space diagram network structure for the video, and learning the spatial similarity relation between video frames to obtain the video spatial characteristics;
s3, constructing a one-dimensional time sequence convolution network, and learning a time sequence continuous relation between video frames to obtain video time sequence characteristics;
s4, fusing the time sequence characteristics and the space characteristics to obtain space-time correlation fusion characteristics with better discrimination capability;
s5, respectively generating space-time related action pooling characteristics and background pooling characteristics by using an action-background attention mechanism;
s6, constructing a three-branch classification network, predicting a class activation sequence of the spatial-temporal correlation of the motion and the background in the video by using a basic branch, predicting an motion class activation sequence and a background class activation sequence in the video by using two pooling branches with opposite training targets, averaging the three class activation sequences by Top-k to obtain three video-level class activation scores, and finally obtaining three classification loss functions by adopting cross entropy;
s7, training a network model by combining a cross entropy classification loss function in the three-branch classification network, and calculating a total loss value;
s8, using the trained network model for motion detection, using a class activation sequence generated by a motion pooling branch in the network model for final motion positioning and classification, setting a classification threshold value for motion positioning, setting a positioning threshold value, combining continuous frames of which the class activation scores are greater than or equal to the positioning threshold value to form a candidate motion proposal, and deleting motion prediction with high overlapping degree in the candidate proposal by a non-maximum inhibition method to obtain a motion detection result.
2. The weakly supervised temporal motion detection method based on spatiotemporal association learning of claim 1, wherein the step S2 includes: taking each frame sampled in the video as a node of a graph network structure, and taking the similarity between the nodes as the weight of an edge; measuring the relation between frames by using a cosine similarity function to obtain an adjacent matrix of the space map, setting a threshold value, and discarding edges with weights smaller than the threshold value;
the neighboring matrix of the spatial map is represented as follows:
Figure FDA0003593835380000021
wherein x is i ,x j Is a feature of video frame i, j, i, j ═ 1,2, …, T;
setting a threshold value delta for discarding edges with weights less than the threshold value delta, wherein the neighboring matrix formula of the discarded spatial map is as follows:
Figure FDA0003593835380000022
based on matrix A ij Constructing edges of graph, initial feature X (0) X is the input to the first layer, and the transform is performed by performing a graph convolution operation, the transform formula being:
Figure FDA0003593835380000023
wherein l ≧ 1 denotes the number of layers of the graph convolution network,
Figure FDA0003593835380000024
is the output of the last layer of the graph convolution network, W (l-1) Is the weight matrix that needs to be learned,
Figure FDA0003593835380000025
is that
Figure FDA0003593835380000026
The degree of regularization of the laplacian of (c),
Figure FDA0003593835380000027
is that
Figure FDA0003593835380000028
The degree matrix of (a) is,
Figure FDA0003593835380000029
representing a matrix with self-circulation, ReLU is the activation function.
3. The weakly supervised temporal motion detection method based on spatiotemporal association learning of claim 1, wherein the step S3 includes: and aggregating the characteristics of the adjacent segments by adopting a one-dimensional time sequence convolution network to obtain the updated time sequence characteristics, wherein the formula is as follows:
X T =Φ t (X,W t ) (4)
wherein phi t Is represented by 1Dimensional time convolution networks, W t Are the weight parameters that need to be learned.
4. The weakly supervised temporal sequence action detection method based on spatiotemporal association learning of claim 1, wherein the formula for fusing the temporal characteristics and the spatial characteristics is as follows:
X Fusion =f Fusion (X,W) (5)
wherein the content of the first and second substances,
Figure FDA00035938353800000210
e is the dimension of the space-time correlation feature, W ═ W (l-1) ,W t Is the parameter set of the graph convolution network and the one-dimensional convolution network.
5. The weakly supervised time series action detection method based on spatio-temporal correlation learning as claimed in claim 1, wherein the action-background attention mechanism is composed of two 1-D convolution networks, followed by a softmax layer, and outputs a T x 2 attention matrix, and the expression of the attention matrix is as follows:
Figure FDA0003593835380000031
wherein the content of the first and second substances,
Figure FDA0003593835380000032
Φ att is a filter mechanism comprising a plurality of convolution layers, W att Parameters to be learned;
performing the softmax activation function along dimension 2 of the T2 matrix, resulting in a probability of action or background
Figure FDA0003593835380000033
The formula is as follows:
Figure FDA0003593835380000034
where i is { act, bkg },
Figure FDA0003593835380000035
representing the probability that the tth frame in the video is an action or background, a act And a bkg A first column and a second column of attention maps of action and background, respectively;
the features pooled for action and background are generated separately, with the following formula:
Figure FDA0003593835380000036
wherein the content of the first and second substances,
Figure FDA0003593835380000037
is the attention of the action or the background,
Figure FDA0003593835380000038
Figure FDA0003593835380000039
representing element-by-element multiplication along the timing dimension;
the associated features of action or context pooling are represented using equation (9), which is as follows:
X i_Fusion =f Fusion (X i ,W) (9)。
6. the weakly supervised temporal motion detection method based on spatiotemporal association learning of claim 1, wherein the step S6 includes: and sending the space-time association fusion features into a classifier consisting of full connection layers, aggregating class activation scores of the frame level along the time dimension by adopting a top-k mean value method, and comparing with the Ground Truth by using a binary cross entropy loss function to obtain the classification loss of the space-time association fusion features.
7. The weakly supervised temporal sequence action detection method based on spatio-temporal association learning of claim 6, wherein the formula for feeding the spatio-temporal association fusion features into the classifier composed of fully connected layers is as follows:
Α base =f cls (X Fusion ,W cls ) (10)
wherein the content of the first and second substances,
Figure FDA00035938353800000310
c represents action type, C +1 represents C action types plus one background
Class, f cls Is a classifier, W cls Is a parameter to be learned;
the formula of the top-k averaging method for aggregating class activation scores at the frame level along the time dimension is as follows:
Figure FDA0003593835380000041
wherein a is ∈ A and is A base The values in the matrix are then compared to each other,
Figure FDA0003593835380000042
m is used to control the number of selected frames;
along the category dimension, calculating the probability that the video belongs to each category by using a softmax activation function, wherein the expression is as follows:
Figure FDA0003593835380000043
wherein the content of the first and second substances,
Figure FDA0003593835380000044
c' represents an action category;
the formula of the classification loss of the space-time correlation fusion characteristics is as follows:
Figure FDA0003593835380000045
wherein, N is the number of videos,
Figure FDA0003593835380000046
is a regularized video-level label,
Figure FDA0003593835380000047
and is
Figure FDA0003593835380000048
Is a video v n The tag vector of (2).
8. The weakly supervised temporal sequence action detection method based on spatiotemporal association learning of claim 1, wherein the step S6 further includes: respectively sending the action pooling space-time characteristics and the background pooling space-time characteristics into an action pooling branch and a background pooling branch to respectively obtain action and background pooling class activation sequences; aggregating class activation scores of the frame level along the time dimension by adopting a top-k averaging method; and comparing the binary cross entropy loss function with the Ground Truth to obtain the classification loss in the action and background pooling branches.
9. The weakly supervised temporal sequence action detection method based on spatiotemporal association learning of claim 8, wherein the formula of the action and background pooled class activation sequence is:
Α i =f cls (X i_Fusion ,W cls ) (14)
wherein f is cls Is a classifier, W cls Is a shared parameter that needs to be learned;
the formula for the classification penalty in the action and background pooling branches is:
Figure FDA0003593835380000051
whereinI ═ act, bkg, N is the number of videos,
Figure FDA0003593835380000052
is a regularized video level label in the action or background pooling branch,
Figure FDA0003593835380000053
is a video v n Tags in action pooling branches;
Figure FDA0003593835380000054
is a video v n Tags in the background pooling branch.
10. The weakly supervised temporal sequence action detection method based on spatiotemporal association learning of claim 1, wherein the calculation formula of the total loss value is as follows:
L cls =L base +L act +L bkg (16)
wherein L is base 、L act 、L bkg The classification loss of the basic branch, the classification loss of the action pooling branch, and the classification loss of the background pooling branch, respectively.
CN202210383307.9A 2022-04-13 2022-04-13 Weak supervision time sequence action detection method based on time-space correlation learning Withdrawn CN114821772A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210383307.9A CN114821772A (en) 2022-04-13 2022-04-13 Weak supervision time sequence action detection method based on time-space correlation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210383307.9A CN114821772A (en) 2022-04-13 2022-04-13 Weak supervision time sequence action detection method based on time-space correlation learning

Publications (1)

Publication Number Publication Date
CN114821772A true CN114821772A (en) 2022-07-29

Family

ID=82534624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210383307.9A Withdrawn CN114821772A (en) 2022-04-13 2022-04-13 Weak supervision time sequence action detection method based on time-space correlation learning

Country Status (1)

Country Link
CN (1) CN114821772A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (en) * 2023-03-30 2023-04-28 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (en) * 2023-03-30 2023-04-28 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111639564B (en) Video pedestrian re-identification method based on multi-attention heterogeneous network
CN110210335B (en) Training method, system and device for pedestrian re-recognition learning model
US11640714B2 (en) Video panoptic segmentation
CN111126126B (en) Intelligent video strip splitting method based on graph convolution neural network
Yang et al. Collaborative learning of gesture recognition and 3d hand pose estimation with multi-order feature analysis
CN110633632A (en) Weak supervision combined target detection and semantic segmentation method based on loop guidance
CN109902612B (en) Monitoring video abnormity detection method based on unsupervised learning
CN106778686A (en) A kind of copy video detecting method and system based on deep learning and graph theory
CN106815576B (en) Target tracking method based on continuous space-time confidence map and semi-supervised extreme learning machine
CN111950372A (en) Unsupervised pedestrian re-identification method based on graph convolution network
CN110728216A (en) Unsupervised pedestrian re-identification method based on pedestrian attribute adaptive learning
Wu et al. Vehicle re-identification in still images: Application of semi-supervised learning and re-ranking
CN113920472A (en) Unsupervised target re-identification method and system based on attention mechanism
CN115641529A (en) Weak supervision time sequence behavior detection method based on context modeling and background suppression
CN114821772A (en) Weak supervision time sequence action detection method based on time-space correlation learning
CN114417975A (en) Data classification method and system based on deep PU learning and class prior estimation
CN113129336A (en) End-to-end multi-vehicle tracking method, system and computer readable medium
CN110942463B (en) Video target segmentation method based on generation countermeasure network
Nguyen et al. Exploiting generic multi-level convolutional neural networks for scene understanding
CN109461162B (en) Method for segmenting target in image
CN116883751A (en) Non-supervision field self-adaptive image recognition method based on prototype network contrast learning
CN115375732A (en) Unsupervised target tracking method and system based on module migration
Qi et al. TCNet: A novel triple-cooperative network for video object detection
CN114463543A (en) Weak supervision semantic segmentation method based on cascade decision and interactive annotation self-promotion
CN113887443A (en) Industrial smoke emission identification method based on attribute perception attention convergence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20220729