CN113420592B

CN113420592B - Agent measurement model-based weak surveillance video behavior positioning method

Info

Publication number: CN113420592B
Application number: CN202110527929.XA
Authority: CN
Inventors: 张宇; 米思娅; 陈子涵
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2022-11-18
Anticipated expiration: 2041-05-14
Also published as: CN113420592A

Abstract

The invention discloses a weak supervision video behavior positioning method based on an agent measurement model, which plays an important role in the field of behavior recognition, and because manual labeling of action interval time sequences is expensive and time-consuming, the effective weak supervision video behavior positioning method is indispensable. The invention provides an agent measurement module which can cluster the same action segment characteristics together, can enable the background segment characteristics in the uncut video to be far away from the action segment characteristics, and can effectively improve the precision of behavior positioning of the video in a weak supervision environment.

Description

Agent measurement model-based weak surveillance video behavior positioning method

Technical Field

The application relates to the technical field of computer vision, in particular to a weak supervision video behavior positioning method based on an agent measurement model.

Background

The video motion positioning refers to a process of training an artificial intelligence model to detect the time and the position of motion in a video and identifying motion types. The method is widely applied to a plurality of fields such as intelligent monitoring, action retrieval, human-computer interaction, virtual reality and the like. The traditional video motion positioning adopts a full-supervised mode, and the labels of the training set not only have the classification labels of each motion in the video, but also have the time labels of the beginning and the end of each motion. However, as the number of videos generated per day in the real world increases, manual frame-level labeling of video action intervals is both expensive and time consuming. And the accuracy of manually marking the time is difficult to guarantee. Therefore, video action localization without weak supervision of action timing labels plays an essential role in the field of video behavior recognition.

Video data in the field of video motion localization has a large number of motion-independent segments, commonly referred to as background segments. Weakly supervised video motion localization requires only the dataset to have motion classification labels at the video level, and does not require annotation of the time of motion start and end. In this case, the key point for successfully training the video motion localization model is that there is a feature similarity in the same motion segment in the video. If the similarity of a segment in the whole training set is low, the segment can be regarded as a background segment with high probability. By means of the segment similarity, the occurrence time of the action can be predicted in the model training process. Therefore, the study of weakly supervised video motion localization is feasible.

The existing weak surveillance video motion positioning method mainly works on the aspects of attention mechanism, multi-instance learning and similarity measurement. However, these methods always incorporate all segments of the video into the model training process, and some video segments with low possibility of containing actions are not excluded from the training process in advance. In addition, the prior method does not combine the representative motion characteristic vector of the common characteristic dimension protector of the same motion class, so that the precision of motion clustering and background separation during model training is difficult to improve. Therefore, a method for positioning the weakly supervised video action based on the proxy measurement model is urgently needed.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the problems in the prior art and realize the positioning and identification of actions by training a neural network model under the condition that a video data set has no action time sequence label, the invention provides a weak supervision video behavior positioning method based on an agent measurement model, which can effectively perform action clustering and background separation by depending on an action agent vector, thereby improving the accuracy of weak supervision action positioning.

The technical scheme is as follows: a weak supervision video behavior positioning method based on an agent metric model comprises the following steps:

the method comprises the following steps: and separating and extracting the feature vectors of the training set video. Considering the untrimmed video V as a collection of segments, where each segment contains an equal number of frames, the video is represented as n segments

Where k represents the sequence number of the video.

Step two: after feature extraction, their uniform embedding forms are calculated: given video V _k The feature vector of each segment is fed into a module consisting of a fully-connected layer, a ReLU active layer and a Dropout layer, resulting in an embedded feature vector, which can be expressed as:

X _emb ＝f _embed (X,Θ _embed )#(1)

wherein f is _embed Representation embedding into the network, [ theta ] _embed Representing a hyper-parameter in the network.

Step three: calculating the segment-level action classification score of each embedded feature vector in the video by applying a ClassActivate sequence module, wherein the segment-level action classification score is expressed as:

wherein g is _cas Representing band with hyper-parameters

The CAS linear classifier of (1).

Step four: next, calculating an action classification score (CAS classification score) of the whole video, and calculating the possibility that the video contains each action classification by using a softmax function, wherein the action classification score is expressed as:

where C denotes the number of action classes, p _c K representing action class c _act Mean of the highest segment classification scores. K is _act Is a hyper-parameter that represents the number of pseudo-action class fragments.

Step five: in order to enable only the most representative part of the video to participate in the training of the neural network model, pseudo motion segments and pseudo backgrounds are screened out from the videoSegment, calculating the tth video V _t Contains the action possibility to be positioned, and the action possibility score is expressed as:

p(V _t,n )＝dropout(||X _t,n ||)#(4)

wherein, X _t，n And the embedded feature vector of the nth segment of the tth video obtained by calculation in the second step is represented by | i. Next, K with the highest action likelihood score in the video is assigned _act Each segment is taken as a pseudo action segment and is marked as A _act . Otherwise, K with the lowest action possibility score in the video _bkg Taking each fragment as a pseudo background segment, and recording the segment as A _bkg 。

Step six: next, a proxy vector P for each action class is trained ¹ ,P ² ,…,P ^C Which constitute an action proxy graph P. Selecting K containing highest CAS classification score of corresponding action in the false action section of the video _topa Each segment was trained. If a video V contains action c in its set of action tags, then its action center vector for action c is calculated as follows:

wherein

CAS classification score, X, for action c at the t-th fragment _t，i Representing the embedded feature vector of the t-th segment. Because the same action has different expression forms in different videos, the action agent vector is the synthesis of the action characteristics of the whole data, and the calculation mode of the action agent vector is expressed as follows:

wherein n is _c Indicating the number of videos with action c-tags. When there is enough video participating in the action agent vectorThe proxy vector can be representative enough of the original action. Thus, a parameter ε is set, and P is considered valid when the number of videos for which the proxy vector P was trained is greater than ε.

Step seven: and performing action clustering and background classification on the action segments in the video according to the action agent vector calculated in the step six. For a segment X in the training video, if X belongs to a pseudo-motion segment, setting a motion clustering loss function L _act And (5) clustering, and if the X belongs to the pseudo background segment, jumping to the step eight. If an action class k is in the tag set of X, and the classification score of X to k is the maximum value of all action classes, the positive proxy set to which the proxy vector corresponding to the action class belongs

While the remaining actions belong to the negative proxy set of X

Action clustering loss function L _act The calculation is represented as follows:

wherein c is _x，k Represents the classification score of action k on segment X, and S () represents cosine similarity. By the loss function, the characteristics of X can be close to the positive proxy set

Away from negative proxy aggregation

Thereby realizing the purpose of clustering the same action characteristics.

Step eight: if a segment X in the training video belongs to a pseudo background segment, in order to effectively distinguish the background from the action, a background classification loss function L is set _bkg Keeping the features of X away from all action-agent vectors P, i.e. all action agentsThe vectors are all negative proxy vectors

Background classification loss function L _bkg Is expressed as follows:

wherein c is _x，k And the meaning of S () is the same as that of step seven. The loss function can enable the characteristics of X to be far away from the representative characteristics of all actions to be detected, so that the possibility that background segments in the video are wrongly judged as action segments is reduced.

Step nine: proxy vector P due to action ^c The training of (2) relies on the segment-level motion classification scores described in step three, while the accuracy of the early classification module of model training is low. So a background modeling penalty function is applied to quickly converge the classification module, followed by the action proxy vector P ^c And (5) training. The background modeling loss function is:

where m denotes a predefined maximum characteristic quantity, C denotes the number of types of actions in the data set, S _c，j Representing the softmax classification score of action c in segment j.

Step ten: setting a multi-label action classification penalty at a prediction score p _i And video level action tag y _i The cross entropy loss function between, expressed as:

the total loss function of the proxy metric model is:

where N represents the number of training videos in a training batch.

Step eleven: and applying the motion agent diagram vector P trained in the sixth step to a test stage. In the testing stage, all possible contained actions of the target video are represented by classification scores

S _t，c ＝π _t，c S(X _t ，P ^c )#(13)

Wherein pi _t,c Represents the segment classification score of the action class c in the t-th segment, S (X) _t ，P ^c ) Proxy vector P representing embedded feature vector and action class c of the t-th segment ^c Cosine similarity between them.

Step twelve: and C, applying the agent measurement model in the step ten to the action positioning data set, training a neural network model by using the video set action classification label of the action positioning data set, and then testing the positioning accuracy by using the trained neural network model.

Further, the method for extracting video features in the first step includes a C3D, I3D, TSN model.

Further, the number K of the false action segments in the step five _act The quantity setting is based on the average of the occurrence of action segments in the actual data set video.

Further, the number K of false background segments in the step five _bkg Set to the number of false action segments K _act Twice as much.

Further, the training segment capable of participating in the action proxy feature vector in the step six must satisfy the following condition: 1. the video to which the segment belongs must contain a label for the corresponding action; 2. the segment must belong to a pseudo-motion segment of the video.

Further, the higher the setting of the parameter epsilon in the sixth step, the higher the precision of the action agent vector P. When the value of the parameter epsilon is larger than a certain threshold, the training of P converges. The setting of the parameter epsilon is changed according to the actual training process.

Further, in the step ten, the values of the parameters α and β represent the degree of participation of the proxy metric module loss function in the model training in the step seven and the step eight, and the values of the parameters α and β are set to be lower before the action proxy vector graph P is trained, and are set to be higher after the action proxy vector graph P is trained.

Further, in the eleventh step, the segment classification score pi may be multiplied by cosine similarity S (X, P) between the embedded feature vector and the proxy vector of the action class one or more times.

Has the beneficial effects that: the invention provides a proxy measurement model method for weak supervision video action positioning, which can obtain higher action positioning accuracy compared with the existing method.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a tensioboardX visual feature distribution diagram of a proxy vector from 20 action classes of the THUMOS14 data set trained by the present invention;

FIG. 3 is a diagram showing the comparison between the action positioning result of the uncut video with the subject of golf and the actual situation of the video according to the present invention;

FIG. 4 is a diagram comparing the action positioning result of the uncut video with the video reality for beach volleyball according to the invention;

Detailed Description

The invention will be described in further detail with reference to the following detailed description and accompanying drawings:

the embodiment provides a weakly supervised video behavior localization method based on a proxy metric model, and applies the method to the action localization effect detection of the THUMOS14 uncut data set.

The flow of the method is shown in figure 1:

the method comprises the following steps: the feature vectors of the training set video are separated and extracted, and the un-cropped video V can be regarded as a collection of segments, where each segment contains an equal number of frames. Dividing video into n segments

Wherein k represents the serial number of the video, and the method for extracting the video features comprises C3D, I3D and TSN models.

Step two: after feature extraction, their uniform embedding forms are calculated: given video V _k The feature vectors of each segment are fed into a module consisting of one fully connected layer, one ReLU active layer and one Dropout layer. The embedded feature vector may be represented as:

X _emb ＝f _embed (X,Θ _embed )#(1)

wherein f is _embed Representation embedding into the network, Θ _embed Representing a hyper-parameter in the network.

Step three: calculating the classification score of the segment-level actions of each embedded feature vector in the video by applying a ClassActivatenSequence module:

wherein g is _cas Representing the band with a hyperparameter

The CAS linear classifier of (1).

Step four: next, the motion classification score for the entire video is calculated. Calculating the possibility that the video contains various action classifications by using a softmax function:

where C denotes the number of action classes, p _c K representing action class c _act Mean of the highest segment classification scores. K _act Is a hyper-parameter representing the number of pseudo-action class fragments and its way of calculation will be discussed in step five.

Step five: in order to enable only the most representative part of the video to participate in the training of the neural network model, a pseudo-motion segment and a pseudo-background segment are screened out from the video. Calculating the t video V _t Segment n of (d) contains the action likelihood of the desired location, denoted as action likelihood score:

p(V _t，n )＝dropout(||X _t，n ||)#(4)

wherein, X _t，n And II, representing the n-th segment of the t-th video obtained by calculation in the step II, wherein | L. Next, K with the highest action likelihood score in the video is assigned _act Each segment is taken as a pseudo action segment and is marked as A _act . Otherwise, K with the lowest action possibility score in the video _bkg Taking each fragment as a pseudo background segment, and recording the segment as A _bkg 。

Number of false action segments K _act The number setting is based on the average of the occurrence of motion segments in the actual data set video, the number of false background segments K _bkg Set to the number of false action segments K _act Twice of

Step six: training proxy vector P for each action class ¹ ，P ² ，…，P ^C Which constitute an action proxy graph P. Selecting K containing highest CAS classification score of corresponding action in the false action section of the video _topa Each segment is trained. If a video V contains action c in its set of action tags, then its action center vector for action c is calculated as follows:

wherein

CAS classification score, X, for action c at the t-th fragment _t,i Representing the embedded feature vector of the t-th segment. Since the same motion can have different expressions in different videos, the motion agent vector is the synthesis of the motion characteristics of the whole data. The motion agent vector calculation method is expressed as:

wherein n is _c Indicating the number of videos with action c-tags. When there is enough video participating in the motion proxy vector, the proxy vector can only be representative enough of the original motion. Therefore, a parameter epsilon is set, and when the number of videos of the training agent vector P is larger than epsilon, the P is considered to be effective, and the higher the setting of the parameter epsilon, the higher the precision of the action agent vector P. When the value of the parameter epsilon is larger than a certain threshold, the training of P converges. The setting of the parameter epsilon is changed according to the actual training process.

The training segments that can participate in the action agent feature vector must satisfy the following condition: 1. the video to which the segment belongs must contain a label for the corresponding action; 2. the segment must belong to a pseudo-motion segment of the video.

Fig. 2 shows tensorboardX visualization feature distribution of 20 action classes of proxy vectors trained in the thumb 14 dataset by the step six method. It can be seen that the distribution of the trained motion agent vectors is uniform and representative. It is feasible to calculate the distance between the motion segment and different original motion feature vectors according to the trained motion agent vector.

Step seven: and then, performing action clustering and background classification on the action segments in the video by means of the action agent vector calculated in the step six. For a segment X in the training video, if X belongs to a pseudo-motion segment, setting a motion clustering loss function L _act Clustering is carried out, and if X belongs to a false background segment, skippingAnd step eight. If an action class k is in the tag set of X, and the classification score of X to k is the maximum value of all action classes, the positive proxy set to which the proxy vector corresponding to the action class belongs

While the remaining actions belong to the negative agent set of X

Keeping away from negative proxy sets

Thereby realizing the purpose of clustering the same action characteristics.

Step eight: if a segment X in the training video belongs to a pseudo background segment, setting a background classification loss function L to more effectively distinguish the background from the action _bkg Keeping the characteristics of X away from all action agent vectors P, namely all action agent vectors are negative agent vectors

Background classification loss function L _bkg Is expressed as follows:

wherein c is _x,k And the meaning of S () is the same as that of step seven. The loss function can enable the characteristics of X to be far away from the representative characteristics of all actions to be detected, so that the possibility that background segments in the video are wrongly judged as action segments is reduced.

Step nine: since the training of the proxy vector relies on the segment-level motion classification scores described in step three, the accuracy of the early classification module of model training is low. The background modeling penalty function is applied to quickly converge the classification module before training of the proxy vectors begins. Background modeling loss function

the total loss function of the proxy metric model is:

where N represents the number of training videos in a training batch.

The values of the parameters alpha and beta represent the participation degree of the loss function of the proxy measurement module in model training in the seventh step and the eighth step, and the values of the parameters alpha and beta are set to be lower before the training of the action proxy vector diagram P is finished and are set to be higher after the training of the action proxy vector diagram P is finished.

Step eleven: the trained action agent graph P is then applied to the testing phase. In the testing stage, all possible contained actions of the target video are represented by classification scores

S _t,c ＝π _t，c S(X _t ，P ^c )#(13)

Wherein pi _t，c Represents the segment classification score, S (X), of action class c in the t-th segment _t ，P ^c ) Proxy vector P representing embedded feature vector and action class c of the t-th segment ^c Cosine similarity between them. The segment classification score pi may be multiplied one or more times by the cosine similarity S (X, P) between the embedded feature vector and the proxy vector of the action class.

Step twelve: the proxy measurement model is applied to the THUMOS14 data set, only the video set motion classification labels of the data set are used for training a neural network model, and then the trained model is used for testing the positioning accuracy.

In this example, the thumb 14 dataset was tested for weakly supervised action localization. In the case of IoU threshold ranges at 0.3. This is a very competitive result compared to previous weak supervised action localization methods. Fig. 3 and 4 show the positioning effect of the method proposed herein in an actual single uncut video. It can be seen that for the beach volleyball of fig. 3, the method herein results in a relatively accurate positioning result for each action interval. Some misjudgment of the location of the action golf in fig. 4 occurs because the person in the video makes the action of golf but does not swing, which is an error that is difficult to avoid in the current video action location method. Overall, the difference between the positioning results and the real situation is very small, which strongly demonstrates the great advantage of the proxy metric model proposed herein in practical applications.

Claims

1. A weak supervision video behavior positioning method based on an agent metric model is characterized in that: the method comprises the following steps:

the method comprises the following steps: separating and extracting feature vectors of a training set video, considering an untrimmed video V as a collection of segments, wherein each segment contains an equal number of frames, representing the video as n segments

Where k represents the sequence number of the video.

Step two: after feature extraction, a uniform embedding form is calculated: given video V _k Feeding the feature vector of each segment into a module consisting of a fully-connected layer, a ReLU active layer and a Dropout layer, resulting in an embedded feature vector, said embedded feature vector being represented as:

X _emb ＝f _embed (X,Θ _embed )#(1)

Step three: calculating the segment-level action classification score of each embedded feature vector in the video by applying a Class Activation Sequence module, wherein the segment-level action classification score is expressed as:

wherein g is _cas Representing band with hyper-parameters

The CAS linear classifier of (1).

Step four: calculating action classification scores (CAS classification scores) of the whole video, and calculating the possibility that the video contains each action classification by using a softmax function, wherein the action classification scores are expressed as:

where C denotes the number of action classes, p _c K representing action class c _act Mean of the highest segment classification scores. K _act Is a hyper-parameter that represents the number of pseudo-action class fragments.

Step five: screening out a pseudo-motion segment and a pseudo-background segment from the video, and calculating the tth video V _t Contains the action possibility to be positioned, and the action possibility score is expressed as:

p(V _t,n )＝dropout(||X _t,n ||)#(4)

wherein, X _t,n And the embedded feature vector of the nth segment of the tth video obtained by calculation in the second step is represented by | i | · | | represents an L2 norm function. K with highest action possibility score in video _act Each segment is taken as a pseudo action segment and is marked as A _act . Scoring K with lowest action likelihood in video _bkg Taking each fragment as a pseudo background segment, and recording the segment as A _bkg 。

Step six: training action agent vector { P ¹ ,P ² ,…,P ^C H, they form an action proxy map P, and K with the highest CAS classification score of the corresponding action is selected from the false action segment of the video _topa Each segment is trained. If a video V contains action c in its set of action tags, then its action center vector for action c is calculated as follows:

wherein

CAS classification score, X, for action c at the t-th fragment _t,i Representing the embedded feature vector of the t-th segment. Because the same action has different expression forms in different videos, the action agent vector is the synthesis of the action characteristics of the whole data, and the calculation mode of the action agent vector is expressed as follows:

wherein n is _c Indicating the number of videos with action c-tags. Setting a parameter epsilon, when training the action agent vector P ^c Is greater than epsilon, P is considered to be ^c Is effective.

While the remaining actions belong to the negative agent set of X

Action clustering loss function L _act The calculation is expressed as follows:

wherein c is _x,k Represents the classification score of action k on segment X, and S () represents cosine similarity.

Step eight: if a segment X in the training video belongs to a pseudo background segment, in order to effectively distinguish the background from the action, a background classification loss function L is set _bkg Keeping the characteristics of X away from all action agent vectors P, namely all action agent vectors are negative agent vectors

Background classification loss function L _bkg Is expressed as follows:

step nine: proxy vector P due to action ^c The training of (2) relies on the segment-level motion classification scores described in step three, while the accuracy of the early classification module of model training is low. So applying the background modeling loss function to make the classification module converge quickly, and then acting the agent vector P ^c And (5) training. The background modeling loss function is:

where m denotes a predefined maximum characteristic quantity, C denotes the number of types of actions in the data set, S _c,j Representing the softmax classification score of action c in segment j.

Step ten: setting multi-label action classification loss, prediction score p _i And video level action tag y _i The cross entropy loss function between is expressed as:

the total loss function of the proxy metric model is:

where N represents the number of training videos in a training batch.

S _t,c ＝π _t，c S(X _t ，P ^c )#(13)

Wherein pi _t，c Represents the segment classification score, S (X), of action class c in the t-th segment _t ,P ^c ) Proxy vector P representing embedded feature vector and action class c of the t-th segment ^c Cosine similarity between them.

Step twelve: and (5) applying the agent measurement model in the step ten to the action positioning data set, training a neural network model by using the video set action classification label of the action positioning data set, and then testing the positioning accuracy by using the trained neural network model.

2. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the method for extracting the video features of the training set in the first step comprises C3D, I3D and TSN models.

3. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the number K of the pseudo action segments in the step five _act The number is set to the average of the occurrences of action segments in the actual data set video.

4. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the number K of the pseudo background segments in the step five _bkg Set to the number of false action segments K _act Twice as much.

5. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the training segments capable of participating in the action proxy feature vector in the sixth step must satisfy the following conditions: 1. the video to which the segment belongs must contain a label for the corresponding action; 2. the segment must belong to a pseudo-action segment of the video.

6. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the higher the setting of the parameter epsilon in the step six is, the action agent vector P ^c The higher the accuracy of (c). When the value of parameter epsilon is larger than a certain threshold value, P ^c The training of (2) converges.

7. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: in the step ten, the values of the parameters α and β represent the degree of participation of the loss function of the proxy metric module in the step seven and the step eight in model training, and the values of the parameters α and β are set to be lower before the training of the action proxy vector diagram P is completed and to be higher after the training of the action proxy vector diagram P is completed.

8. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: in the eleventh step, the segment classification score pi may be multiplied by the cosine similarity S (X, P) between the embedded feature vector and the proxy vector of the action class one or more times.