CN113420592B - Agent measurement model-based weak surveillance video behavior positioning method - Google Patents

Agent measurement model-based weak surveillance video behavior positioning method Download PDF

Info

Publication number
CN113420592B
CN113420592B CN202110527929.XA CN202110527929A CN113420592B CN 113420592 B CN113420592 B CN 113420592B CN 202110527929 A CN202110527929 A CN 202110527929A CN 113420592 B CN113420592 B CN 113420592B
Authority
CN
China
Prior art keywords
action
video
segment
vector
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110527929.XA
Other languages
Chinese (zh)
Other versions
CN113420592A (en
Inventor
张宇
米思娅
陈子涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202110527929.XA priority Critical patent/CN113420592B/en
Publication of CN113420592A publication Critical patent/CN113420592A/en
Application granted granted Critical
Publication of CN113420592B publication Critical patent/CN113420592B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a weak supervision video behavior positioning method based on an agent measurement model, which plays an important role in the field of behavior recognition, and because manual labeling of action interval time sequences is expensive and time-consuming, the effective weak supervision video behavior positioning method is indispensable. The invention provides an agent measurement module which can cluster the same action segment characteristics together, can enable the background segment characteristics in the uncut video to be far away from the action segment characteristics, and can effectively improve the precision of behavior positioning of the video in a weak supervision environment.

Description

Agent measurement model-based weak surveillance video behavior positioning method
Technical Field
The application relates to the technical field of computer vision, in particular to a weak supervision video behavior positioning method based on an agent measurement model.
Background
The video motion positioning refers to a process of training an artificial intelligence model to detect the time and the position of motion in a video and identifying motion types. The method is widely applied to a plurality of fields such as intelligent monitoring, action retrieval, human-computer interaction, virtual reality and the like. The traditional video motion positioning adopts a full-supervised mode, and the labels of the training set not only have the classification labels of each motion in the video, but also have the time labels of the beginning and the end of each motion. However, as the number of videos generated per day in the real world increases, manual frame-level labeling of video action intervals is both expensive and time consuming. And the accuracy of manually marking the time is difficult to guarantee. Therefore, video action localization without weak supervision of action timing labels plays an essential role in the field of video behavior recognition.
Video data in the field of video motion localization has a large number of motion-independent segments, commonly referred to as background segments. Weakly supervised video motion localization requires only the dataset to have motion classification labels at the video level, and does not require annotation of the time of motion start and end. In this case, the key point for successfully training the video motion localization model is that there is a feature similarity in the same motion segment in the video. If the similarity of a segment in the whole training set is low, the segment can be regarded as a background segment with high probability. By means of the segment similarity, the occurrence time of the action can be predicted in the model training process. Therefore, the study of weakly supervised video motion localization is feasible.
The existing weak surveillance video motion positioning method mainly works on the aspects of attention mechanism, multi-instance learning and similarity measurement. However, these methods always incorporate all segments of the video into the model training process, and some video segments with low possibility of containing actions are not excluded from the training process in advance. In addition, the prior method does not combine the representative motion characteristic vector of the common characteristic dimension protector of the same motion class, so that the precision of motion clustering and background separation during model training is difficult to improve. Therefore, a method for positioning the weakly supervised video action based on the proxy measurement model is urgently needed.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problems in the prior art and realize the positioning and identification of actions by training a neural network model under the condition that a video data set has no action time sequence label, the invention provides a weak supervision video behavior positioning method based on an agent measurement model, which can effectively perform action clustering and background separation by depending on an action agent vector, thereby improving the accuracy of weak supervision action positioning.
The technical scheme is as follows: a weak supervision video behavior positioning method based on an agent metric model comprises the following steps:
the method comprises the following steps: and separating and extracting the feature vectors of the training set video. Considering the untrimmed video V as a collection of segments, where each segment contains an equal number of frames, the video is represented as n segments
Figure RE-RE-GDA0003163990470000021
Where k represents the sequence number of the video.
Step two: after feature extraction, their uniform embedding forms are calculated: given video V k The feature vector of each segment is fed into a module consisting of a fully-connected layer, a ReLU active layer and a Dropout layer, resulting in an embedded feature vector, which can be expressed as:
X emb =f embed (X,Θ embed )#(1)
wherein f is embed Representation embedding into the network, [ theta ] embed Representing a hyper-parameter in the network.
Step three: calculating the segment-level action classification score of each embedded feature vector in the video by applying a ClassActivate sequence module, wherein the segment-level action classification score is expressed as:
Figure RE-RE-GDA0003163990470000022
wherein g is cas Representing band with hyper-parameters
Figure RE-RE-GDA0003163990470000023
The CAS linear classifier of (1).
Step four: next, calculating an action classification score (CAS classification score) of the whole video, and calculating the possibility that the video contains each action classification by using a softmax function, wherein the action classification score is expressed as:
Figure RE-RE-GDA0003163990470000024
where C denotes the number of action classes, p c K representing action class c act Mean of the highest segment classification scores. K is act Is a hyper-parameter that represents the number of pseudo-action class fragments.
Step five: in order to enable only the most representative part of the video to participate in the training of the neural network model, pseudo motion segments and pseudo backgrounds are screened out from the videoSegment, calculating the tth video V t Contains the action possibility to be positioned, and the action possibility score is expressed as:
p(V t,n )=dropout(||X t,n ||)#(4)
wherein, X t,n And the embedded feature vector of the nth segment of the tth video obtained by calculation in the second step is represented by | i. Next, K with the highest action likelihood score in the video is assigned act Each segment is taken as a pseudo action segment and is marked as A act . Otherwise, K with the lowest action possibility score in the video bkg Taking each fragment as a pseudo background segment, and recording the segment as A bkg
Step six: next, a proxy vector P for each action class is trained 1 ,P 2 ,…,P C Which constitute an action proxy graph P. Selecting K containing highest CAS classification score of corresponding action in the false action section of the video topa Each segment was trained. If a video V contains action c in its set of action tags, then its action center vector for action c is calculated as follows:
Figure RE-RE-GDA0003163990470000031
wherein
Figure RE-RE-GDA0003163990470000032
CAS classification score, X, for action c at the t-th fragment t,i Representing the embedded feature vector of the t-th segment. Because the same action has different expression forms in different videos, the action agent vector is the synthesis of the action characteristics of the whole data, and the calculation mode of the action agent vector is expressed as follows:
Figure RE-RE-GDA0003163990470000033
wherein n is c Indicating the number of videos with action c-tags. When there is enough video participating in the action agent vectorThe proxy vector can be representative enough of the original action. Thus, a parameter ε is set, and P is considered valid when the number of videos for which the proxy vector P was trained is greater than ε.
Step seven: and performing action clustering and background classification on the action segments in the video according to the action agent vector calculated in the step six. For a segment X in the training video, if X belongs to a pseudo-motion segment, setting a motion clustering loss function L act And (5) clustering, and if the X belongs to the pseudo background segment, jumping to the step eight. If an action class k is in the tag set of X, and the classification score of X to k is the maximum value of all action classes, the positive proxy set to which the proxy vector corresponding to the action class belongs
Figure RE-RE-GDA0003163990470000034
While the remaining actions belong to the negative proxy set of X
Figure RE-RE-GDA0003163990470000035
Action clustering loss function L act The calculation is represented as follows:
Figure RE-RE-GDA0003163990470000036
wherein c is x,k Represents the classification score of action k on segment X, and S () represents cosine similarity. By the loss function, the characteristics of X can be close to the positive proxy set
Figure RE-RE-GDA0003163990470000037
Away from negative proxy aggregation
Figure RE-RE-GDA0003163990470000038
Thereby realizing the purpose of clustering the same action characteristics.
Step eight: if a segment X in the training video belongs to a pseudo background segment, in order to effectively distinguish the background from the action, a background classification loss function L is set bkg Keeping the features of X away from all action-agent vectors P, i.e. all action agentsThe vectors are all negative proxy vectors
Figure RE-RE-GDA0003163990470000041
Background classification loss function L bkg Is expressed as follows:
Figure RE-RE-GDA0003163990470000042
wherein c is x,k And the meaning of S () is the same as that of step seven. The loss function can enable the characteristics of X to be far away from the representative characteristics of all actions to be detected, so that the possibility that background segments in the video are wrongly judged as action segments is reduced.
Step nine: proxy vector P due to action c The training of (2) relies on the segment-level motion classification scores described in step three, while the accuracy of the early classification module of model training is low. So a background modeling penalty function is applied to quickly converge the classification module, followed by the action proxy vector P c And (5) training. The background modeling loss function is:
Figure RE-RE-GDA0003163990470000043
Figure RE-RE-GDA0003163990470000044
where m denotes a predefined maximum characteristic quantity, C denotes the number of types of actions in the data set, S c,j Representing the softmax classification score of action c in segment j.
Step ten: setting a multi-label action classification penalty at a prediction score p i And video level action tag y i The cross entropy loss function between, expressed as:
Figure RE-RE-GDA0003163990470000045
the total loss function of the proxy metric model is:
Figure RE-RE-GDA0003163990470000046
where N represents the number of training videos in a training batch.
Step eleven: and applying the motion agent diagram vector P trained in the sixth step to a test stage. In the testing stage, all possible contained actions of the target video are represented by classification scores
S t,c =π t,c S(X t ,P c )#(13)
Wherein pi t,c Represents the segment classification score of the action class c in the t-th segment, S (X) t ,P c ) Proxy vector P representing embedded feature vector and action class c of the t-th segment c Cosine similarity between them.
Step twelve: and C, applying the agent measurement model in the step ten to the action positioning data set, training a neural network model by using the video set action classification label of the action positioning data set, and then testing the positioning accuracy by using the trained neural network model.
Further, the method for extracting video features in the first step includes a C3D, I3D, TSN model.
Further, the number K of the false action segments in the step five act The quantity setting is based on the average of the occurrence of action segments in the actual data set video.
Further, the number K of false background segments in the step five bkg Set to the number of false action segments K act Twice as much.
Further, the training segment capable of participating in the action proxy feature vector in the step six must satisfy the following condition: 1. the video to which the segment belongs must contain a label for the corresponding action; 2. the segment must belong to a pseudo-motion segment of the video.
Further, the higher the setting of the parameter epsilon in the sixth step, the higher the precision of the action agent vector P. When the value of the parameter epsilon is larger than a certain threshold, the training of P converges. The setting of the parameter epsilon is changed according to the actual training process.
Further, in the step ten, the values of the parameters α and β represent the degree of participation of the proxy metric module loss function in the model training in the step seven and the step eight, and the values of the parameters α and β are set to be lower before the action proxy vector graph P is trained, and are set to be higher after the action proxy vector graph P is trained.
Further, in the eleventh step, the segment classification score pi may be multiplied by cosine similarity S (X, P) between the embedded feature vector and the proxy vector of the action class one or more times.
Has the beneficial effects that: the invention provides a proxy measurement model method for weak supervision video action positioning, which can obtain higher action positioning accuracy compared with the existing method.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a tensioboardX visual feature distribution diagram of a proxy vector from 20 action classes of the THUMOS14 data set trained by the present invention;
FIG. 3 is a diagram showing the comparison between the action positioning result of the uncut video with the subject of golf and the actual situation of the video according to the present invention;
FIG. 4 is a diagram comparing the action positioning result of the uncut video with the video reality for beach volleyball according to the invention;
Detailed Description
The invention will be described in further detail with reference to the following detailed description and accompanying drawings:
the embodiment provides a weakly supervised video behavior localization method based on a proxy metric model, and applies the method to the action localization effect detection of the THUMOS14 uncut data set.
The flow of the method is shown in figure 1:
the method comprises the following steps: the feature vectors of the training set video are separated and extracted, and the un-cropped video V can be regarded as a collection of segments, where each segment contains an equal number of frames. Dividing video into n segments
Figure RE-RE-GDA0003163990470000061
Figure RE-RE-GDA0003163990470000062
Wherein k represents the serial number of the video, and the method for extracting the video features comprises C3D, I3D and TSN models.
Step two: after feature extraction, their uniform embedding forms are calculated: given video V k The feature vectors of each segment are fed into a module consisting of one fully connected layer, one ReLU active layer and one Dropout layer. The embedded feature vector may be represented as:
X emb =f embed (X,Θ embed )#(1)
wherein f is embed Representation embedding into the network, Θ embed Representing a hyper-parameter in the network.
Step three: calculating the classification score of the segment-level actions of each embedded feature vector in the video by applying a ClassActivatenSequence module:
Figure RE-RE-GDA0003163990470000063
wherein g is cas Representing the band with a hyperparameter
Figure RE-RE-GDA0003163990470000064
The CAS linear classifier of (1).
Step four: next, the motion classification score for the entire video is calculated. Calculating the possibility that the video contains various action classifications by using a softmax function:
Figure RE-RE-GDA0003163990470000071
where C denotes the number of action classes, p c K representing action class c act Mean of the highest segment classification scores. K act Is a hyper-parameter representing the number of pseudo-action class fragments and its way of calculation will be discussed in step five.
Step five: in order to enable only the most representative part of the video to participate in the training of the neural network model, a pseudo-motion segment and a pseudo-background segment are screened out from the video. Calculating the t video V t Segment n of (d) contains the action likelihood of the desired location, denoted as action likelihood score:
p(V t,n )=dropout(||X t,n ||)#(4)
wherein, X t,n And II, representing the n-th segment of the t-th video obtained by calculation in the step II, wherein | L. Next, K with the highest action likelihood score in the video is assigned act Each segment is taken as a pseudo action segment and is marked as A act . Otherwise, K with the lowest action possibility score in the video bkg Taking each fragment as a pseudo background segment, and recording the segment as A bkg
Number of false action segments K act The number setting is based on the average of the occurrence of motion segments in the actual data set video, the number of false background segments K bkg Set to the number of false action segments K act Twice of
Step six: training proxy vector P for each action class 1 ,P 2 ,…,P C Which constitute an action proxy graph P. Selecting K containing highest CAS classification score of corresponding action in the false action section of the video topa Each segment is trained. If a video V contains action c in its set of action tags, then its action center vector for action c is calculated as follows:
Figure RE-RE-GDA0003163990470000072
wherein
Figure RE-RE-GDA0003163990470000073
CAS classification score, X, for action c at the t-th fragment t,i Representing the embedded feature vector of the t-th segment. Since the same motion can have different expressions in different videos, the motion agent vector is the synthesis of the motion characteristics of the whole data. The motion agent vector calculation method is expressed as:
Figure RE-RE-GDA0003163990470000074
wherein n is c Indicating the number of videos with action c-tags. When there is enough video participating in the motion proxy vector, the proxy vector can only be representative enough of the original motion. Therefore, a parameter epsilon is set, and when the number of videos of the training agent vector P is larger than epsilon, the P is considered to be effective, and the higher the setting of the parameter epsilon, the higher the precision of the action agent vector P. When the value of the parameter epsilon is larger than a certain threshold, the training of P converges. The setting of the parameter epsilon is changed according to the actual training process.
The training segments that can participate in the action agent feature vector must satisfy the following condition: 1. the video to which the segment belongs must contain a label for the corresponding action; 2. the segment must belong to a pseudo-motion segment of the video.
Fig. 2 shows tensorboardX visualization feature distribution of 20 action classes of proxy vectors trained in the thumb 14 dataset by the step six method. It can be seen that the distribution of the trained motion agent vectors is uniform and representative. It is feasible to calculate the distance between the motion segment and different original motion feature vectors according to the trained motion agent vector.
Step seven: and then, performing action clustering and background classification on the action segments in the video by means of the action agent vector calculated in the step six. For a segment X in the training video, if X belongs to a pseudo-motion segment, setting a motion clustering loss function L act Clustering is carried out, and if X belongs to a false background segment, skippingAnd step eight. If an action class k is in the tag set of X, and the classification score of X to k is the maximum value of all action classes, the positive proxy set to which the proxy vector corresponding to the action class belongs
Figure RE-RE-GDA0003163990470000081
While the remaining actions belong to the negative agent set of X
Figure RE-RE-GDA0003163990470000082
Action clustering loss function L act The calculation is represented as follows:
Figure RE-RE-GDA0003163990470000083
wherein c is x,k Represents the classification score of action k on segment X, and S () represents cosine similarity. By the loss function, the characteristics of X can be close to the positive proxy set
Figure RE-RE-GDA0003163990470000084
Keeping away from negative proxy sets
Figure RE-RE-GDA0003163990470000085
Thereby realizing the purpose of clustering the same action characteristics.
Step eight: if a segment X in the training video belongs to a pseudo background segment, setting a background classification loss function L to more effectively distinguish the background from the action bkg Keeping the characteristics of X away from all action agent vectors P, namely all action agent vectors are negative agent vectors
Figure RE-RE-GDA0003163990470000086
Background classification loss function L bkg Is expressed as follows:
Figure RE-RE-GDA0003163990470000087
wherein c is x,k And the meaning of S () is the same as that of step seven. The loss function can enable the characteristics of X to be far away from the representative characteristics of all actions to be detected, so that the possibility that background segments in the video are wrongly judged as action segments is reduced.
Step nine: since the training of the proxy vector relies on the segment-level motion classification scores described in step three, the accuracy of the early classification module of model training is low. The background modeling penalty function is applied to quickly converge the classification module before training of the proxy vectors begins. Background modeling loss function
Figure RE-RE-GDA0003163990470000091
Figure RE-RE-GDA0003163990470000092
Where m denotes a predefined maximum characteristic quantity, C denotes the number of types of actions in the data set, S c,j Representing the softmax classification score of action c in segment j.
Step ten: setting a multi-label action classification penalty at a prediction score p i And video level action tag y i The cross entropy loss function between, expressed as:
Figure RE-RE-GDA0003163990470000093
the total loss function of the proxy metric model is:
Figure RE-RE-GDA0003163990470000094
where N represents the number of training videos in a training batch.
The values of the parameters alpha and beta represent the participation degree of the loss function of the proxy measurement module in model training in the seventh step and the eighth step, and the values of the parameters alpha and beta are set to be lower before the training of the action proxy vector diagram P is finished and are set to be higher after the training of the action proxy vector diagram P is finished.
Step eleven: the trained action agent graph P is then applied to the testing phase. In the testing stage, all possible contained actions of the target video are represented by classification scores
S t,c =π t,c S(X t ,P c )#(13)
Wherein pi t,c Represents the segment classification score, S (X), of action class c in the t-th segment t ,P c ) Proxy vector P representing embedded feature vector and action class c of the t-th segment c Cosine similarity between them. The segment classification score pi may be multiplied one or more times by the cosine similarity S (X, P) between the embedded feature vector and the proxy vector of the action class.
Step twelve: the proxy measurement model is applied to the THUMOS14 data set, only the video set motion classification labels of the data set are used for training a neural network model, and then the trained model is used for testing the positioning accuracy.
In this example, the thumb 14 dataset was tested for weakly supervised action localization. In the case of IoU threshold ranges at 0.3. This is a very competitive result compared to previous weak supervised action localization methods. Fig. 3 and 4 show the positioning effect of the method proposed herein in an actual single uncut video. It can be seen that for the beach volleyball of fig. 3, the method herein results in a relatively accurate positioning result for each action interval. Some misjudgment of the location of the action golf in fig. 4 occurs because the person in the video makes the action of golf but does not swing, which is an error that is difficult to avoid in the current video action location method. Overall, the difference between the positioning results and the real situation is very small, which strongly demonstrates the great advantage of the proxy metric model proposed herein in practical applications.

Claims (8)

1. A weak supervision video behavior positioning method based on an agent metric model is characterized in that: the method comprises the following steps:
the method comprises the following steps: separating and extracting feature vectors of a training set video, considering an untrimmed video V as a collection of segments, wherein each segment contains an equal number of frames, representing the video as n segments
Figure RE-FDA0003163990460000011
Where k represents the sequence number of the video.
Step two: after feature extraction, a uniform embedding form is calculated: given video V k Feeding the feature vector of each segment into a module consisting of a fully-connected layer, a ReLU active layer and a Dropout layer, resulting in an embedded feature vector, said embedded feature vector being represented as:
X emb =f embed (X,Θ embed )#(1)
wherein f is embed Representation embedding into the network, Θ embed Representing a hyper-parameter in the network.
Step three: calculating the segment-level action classification score of each embedded feature vector in the video by applying a Class Activation Sequence module, wherein the segment-level action classification score is expressed as:
Figure RE-FDA0003163990460000012
wherein g is cas Representing band with hyper-parameters
Figure RE-FDA0003163990460000013
The CAS linear classifier of (1).
Step four: calculating action classification scores (CAS classification scores) of the whole video, and calculating the possibility that the video contains each action classification by using a softmax function, wherein the action classification scores are expressed as:
Figure RE-FDA0003163990460000014
where C denotes the number of action classes, p c K representing action class c act Mean of the highest segment classification scores. K act Is a hyper-parameter that represents the number of pseudo-action class fragments.
Step five: screening out a pseudo-motion segment and a pseudo-background segment from the video, and calculating the tth video V t Contains the action possibility to be positioned, and the action possibility score is expressed as:
p(V t,n )=dropout(||X t,n ||)#(4)
wherein, X t,n And the embedded feature vector of the nth segment of the tth video obtained by calculation in the second step is represented by | i | · | | represents an L2 norm function. K with highest action possibility score in video act Each segment is taken as a pseudo action segment and is marked as A act . Scoring K with lowest action likelihood in video bkg Taking each fragment as a pseudo background segment, and recording the segment as A bkg
Step six: training action agent vector { P 1 ,P 2 ,…,P C H, they form an action proxy map P, and K with the highest CAS classification score of the corresponding action is selected from the false action segment of the video topa Each segment is trained. If a video V contains action c in its set of action tags, then its action center vector for action c is calculated as follows:
Figure RE-FDA0003163990460000021
wherein
Figure RE-FDA0003163990460000022
CAS classification score, X, for action c at the t-th fragment t,i Representing the embedded feature vector of the t-th segment. Because the same action has different expression forms in different videos, the action agent vector is the synthesis of the action characteristics of the whole data, and the calculation mode of the action agent vector is expressed as follows:
Figure RE-FDA0003163990460000023
wherein n is c Indicating the number of videos with action c-tags. Setting a parameter epsilon, when training the action agent vector P c Is greater than epsilon, P is considered to be c Is effective.
Step seven: and performing action clustering and background classification on the action segments in the video according to the action agent vector calculated in the step six. For a segment X in the training video, if X belongs to a pseudo-motion segment, setting a motion clustering loss function L act And (5) clustering, and if the X belongs to the pseudo background segment, jumping to the step eight. If an action class k is in the tag set of X, and the classification score of X to k is the maximum value of all action classes, the positive proxy set to which the proxy vector corresponding to the action class belongs
Figure RE-FDA0003163990460000024
While the remaining actions belong to the negative agent set of X
Figure RE-FDA0003163990460000025
Action clustering loss function L act The calculation is expressed as follows:
Figure RE-FDA0003163990460000026
wherein c is x,k Represents the classification score of action k on segment X, and S () represents cosine similarity.
Step eight: if a segment X in the training video belongs to a pseudo background segment, in order to effectively distinguish the background from the action, a background classification loss function L is set bkg Keeping the characteristics of X away from all action agent vectors P, namely all action agent vectors are negative agent vectors
Figure RE-FDA0003163990460000027
Background classification loss function L bkg Is expressed as follows:
Figure RE-FDA0003163990460000028
step nine: proxy vector P due to action c The training of (2) relies on the segment-level motion classification scores described in step three, while the accuracy of the early classification module of model training is low. So applying the background modeling loss function to make the classification module converge quickly, and then acting the agent vector P c And (5) training. The background modeling loss function is:
Figure RE-FDA0003163990460000031
Figure RE-FDA0003163990460000032
where m denotes a predefined maximum characteristic quantity, C denotes the number of types of actions in the data set, S c,j Representing the softmax classification score of action c in segment j.
Step ten: setting multi-label action classification loss, prediction score p i And video level action tag y i The cross entropy loss function between is expressed as:
Figure RE-FDA0003163990460000033
the total loss function of the proxy metric model is:
Figure RE-FDA0003163990460000034
where N represents the number of training videos in a training batch.
Step eleven: and applying the motion agent diagram vector P trained in the sixth step to a test stage. In the testing stage, all possible contained actions of the target video are represented by classification scores
S t,c =π t,c S(X t ,P c )#(13)
Wherein pi t,c Represents the segment classification score, S (X), of action class c in the t-th segment t ,P c ) Proxy vector P representing embedded feature vector and action class c of the t-th segment c Cosine similarity between them.
Step twelve: and (5) applying the agent measurement model in the step ten to the action positioning data set, training a neural network model by using the video set action classification label of the action positioning data set, and then testing the positioning accuracy by using the trained neural network model.
2. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the method for extracting the video features of the training set in the first step comprises C3D, I3D and TSN models.
3. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the number K of the pseudo action segments in the step five act The number is set to the average of the occurrences of action segments in the actual data set video.
4. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the number K of the pseudo background segments in the step five bkg Set to the number of false action segments K act Twice as much.
5. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the training segments capable of participating in the action proxy feature vector in the sixth step must satisfy the following conditions: 1. the video to which the segment belongs must contain a label for the corresponding action; 2. the segment must belong to a pseudo-action segment of the video.
6. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: the higher the setting of the parameter epsilon in the step six is, the action agent vector P c The higher the accuracy of (c). When the value of parameter epsilon is larger than a certain threshold value, P c The training of (2) converges.
7. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: in the step ten, the values of the parameters α and β represent the degree of participation of the loss function of the proxy metric module in the step seven and the step eight in model training, and the values of the parameters α and β are set to be lower before the training of the action proxy vector diagram P is completed and to be higher after the training of the action proxy vector diagram P is completed.
8. The weakly supervised video behavior localization method based on the proxy metric model as recited in claim 1, wherein: in the eleventh step, the segment classification score pi may be multiplied by the cosine similarity S (X, P) between the embedded feature vector and the proxy vector of the action class one or more times.
CN202110527929.XA 2021-05-14 2021-05-14 Agent measurement model-based weak surveillance video behavior positioning method Active CN113420592B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110527929.XA CN113420592B (en) 2021-05-14 2021-05-14 Agent measurement model-based weak surveillance video behavior positioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110527929.XA CN113420592B (en) 2021-05-14 2021-05-14 Agent measurement model-based weak surveillance video behavior positioning method

Publications (2)

Publication Number Publication Date
CN113420592A CN113420592A (en) 2021-09-21
CN113420592B true CN113420592B (en) 2022-11-18

Family

ID=77712323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110527929.XA Active CN113420592B (en) 2021-05-14 2021-05-14 Agent measurement model-based weak surveillance video behavior positioning method

Country Status (1)

Country Link
CN (1) CN113420592B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111797771A (en) * 2020-07-07 2020-10-20 南京理工大学 Method and system for detecting weak supervision video behaviors based on iterative learning
CN111914644A (en) * 2020-06-30 2020-11-10 西安交通大学 Dual-mode cooperation based weak supervision time sequence action positioning method and system
CN111914778A (en) * 2020-08-07 2020-11-10 重庆大学 Video behavior positioning method based on weak supervised learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111914644A (en) * 2020-06-30 2020-11-10 西安交通大学 Dual-mode cooperation based weak supervision time sequence action positioning method and system
CN111797771A (en) * 2020-07-07 2020-10-20 南京理工大学 Method and system for detecting weak supervision video behaviors based on iterative learning
CN111914778A (en) * 2020-08-07 2020-11-10 重庆大学 Video behavior positioning method based on weak supervised learning

Also Published As

Publication number Publication date
CN113420592A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
Wang et al. Relaxed multiple-instance SVM with application to object discovery
CN110569793B (en) Target tracking method for unsupervised similarity discrimination learning
Jing et al. Videossl: Semi-supervised learning for video classification
Lai et al. Video event detection by inferring temporal instance labels
CN109241829B (en) Behavior identification method and device based on space-time attention convolutional neural network
CN107145862B (en) Multi-feature matching multi-target tracking method based on Hough forest
CN108537119B (en) Small sample video identification method
CN111553127A (en) Multi-label text data feature selection method and device
JP2006172437A (en) Method for determining position of segment boundary in data stream, method for determining segment boundary by comparing data subset with vicinal data subset, program of instruction executable by computer, and system or device for identifying boundary and non-boundary in data stream
CN114930352A (en) Method for training image classification model
CN104966105A (en) Robust machine error retrieving method and system
Ge et al. Fine-grained bird species recognition via hierarchical subset learning
CN112381248A (en) Power distribution network fault diagnosis method based on deep feature clustering and LSTM
CN111598004A (en) Progressive-enhancement self-learning unsupervised cross-domain pedestrian re-identification method
CN112507778B (en) Loop detection method of improved bag-of-words model based on line characteristics
CN109086794B (en) Driving behavior pattern recognition method based on T-LDA topic model
CN107220597B (en) Key frame selection method based on local features and bag-of-words model human body action recognition process
Vainstein et al. Modeling video activity with dynamic phrases and its application to action recognition in tennis videos
Nemade et al. Image segmentation using convolutional neural network for image annotation
CN106056146B (en) The visual tracking method that logic-based returns
Ma et al. Event detection in soccer video based on self-attention
CN113420592B (en) Agent measurement model-based weak surveillance video behavior positioning method
CN113128410A (en) Weak supervision pedestrian re-identification method based on track association learning
CN106326927B (en) A kind of shoes print new category detection method
Artan et al. Combining multiple 2ν-SVM classifiers for tissue segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant