CN114494941A - Comparison learning-based weak supervision time sequence action positioning method - Google Patents

Comparison learning-based weak supervision time sequence action positioning method Download PDF

Info

Publication number
CN114494941A
CN114494941A CN202111610682.4A CN202111610682A CN114494941A CN 114494941 A CN114494941 A CN 114494941A CN 202111610682 A CN202111610682 A CN 202111610682A CN 114494941 A CN114494941 A CN 114494941A
Authority
CN
China
Prior art keywords
action
act
video
network
amb
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111610682.4A
Other languages
Chinese (zh)
Inventor
侯永宏
李岳阳
张浩元
张文静
刘传玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202111610682.4A priority Critical patent/CN114494941A/en
Publication of CN114494941A publication Critical patent/CN114494941A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The invention discloses a weak supervision time sequence action positioning method based on contrast learning, which is used for positioning interested actions from un-clipped videos only under the supervision of a video-level action category label. Firstly, extracting video characteristics from RGB data and optical flow data of an original video by using a pre-trained characteristic extraction network, and sending the video characteristics into a subsequent action positioning network. The action positioning network comprises two branches, wherein one branch maps video characteristics into an original time domain class activation sequence (T-CAS); the other branch is a multi-branch attention model, models a significant action segment, a background segment and a fuzzy action segment in the video respectively, generates three corresponding time domain class activation sequences simultaneously, and enables the network to obtain the capability of separating action features and background features through a multi-example learning (MIL) mechanism. The invention can sense the accurate action time boundary in the un-edited video, avoid the occurrence of the truncation phenomenon of the complete action and greatly improve the action positioning precision.

Description

Comparison learning-based weak supervision time sequence action positioning method
Technical Field
The invention belongs to the fields of computer vision, deep learning and the like, relates to a video positioning technology, and particularly relates to a weak supervision time sequence action positioning method based on contrast learning.
Background
In recent years, with the development of deep learning, the video understanding field has made a very significant breakthrough. The time sequence action positioning is taken as a research hotspot in the field of video understanding, and has great application potential in various real scenes, such as video monitoring, anomaly detection, video retrieval and the like. The main task is to pinpoint the start and end times of the action of interest from the long duration of the un-clipped video and to classify the action correctly. At present, the timing action positioning is mostly trained in a full supervision mode, and the key is to collect enough unfragmented videos labeled frame by frame. However, in the real world, marking massive video data frame by frame requires a large amount of manpower and material resources; in addition, due to the abstraction of the action, the time label of the artificial labeling action is easily influenced by subjective factors of people, so that the labeling information is wrong. Therefore, the time sequence action positioning based on weak supervised learning is derived, and only the action class label of the video level is used as the supervised information in the network training process. Compared with an accurate action time label, the action category label is easier to obtain, and the deviation caused by manual labeling can be effectively avoided.
The existing weak supervision time sequence action positioning method can be divided into two types: the method is inspired by a semantic segmentation technology, weak supervision time sequence action positioning and mapping are carried out to be an action classification problem, a action-background separation mechanism is introduced to construct video level characteristics, and finally a video is identified through an action classifier. And the other method is to express the time sequence action positioning as a multi-example learning task, regard the whole un-clipped video as a multi-example package simultaneously containing a positive sample and a negative sample, wherein the positive sample and the negative sample respectively correspond to an action segment and a background segment in the video, obtain a time domain class activation sequence through a classifier to further describe the probability distribution of the action on time, adopt top-k pooling to aggregate video-level class scores, and finally set a threshold value for the time domain class activation sequence to position the action.
The two methods solve the positioning problem in the un-edited video by learning an effective classification loss function, and although a certain effect can be obtained, similar to most weak supervision learning methods, due to the lack of time labels, the network is difficult to model a complete action generating process, the most significant part in the action can be over-concerned, and some secondary areas with unobvious features are ignored. Furthermore, since the video is not artificially clipped, there are often blurred frames of shot transitions, motion slow-release, etc. in a complete motion, which are semantically related to the motion and are part of the motion, but the motion features are not obvious, resulting in low activation values at these time positions, which are difficult to distinguish from a prominent background segment with the same low activation value, and which are falsely detected as background frames. Therefore, the fuzzy action characteristics in the video are found and refined, so that the network captures more complete action segments, and the method has important significance for improving the weak supervision time sequence action positioning performance.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a weak supervision time sequence action positioning method based on contrast learning. The feature extraction network and the action positioning network are trained separately, the remarkable action, the fuzzy action and the remarkable background in the video are modeled respectively through the multi-branch attention model, and the fuzzy action contrast loss function is introduced to refine the video features, so that the network senses more accurate time boundary, and the action positioning precision is effectively improved.
The invention adopts the following technical scheme for solving the technical problems:
firstly, extracting RGB (red, green and blue) features and optical flow features of an original video by adopting a pre-trained I3D network, cascading to obtain video features X, sending the video features X into a feature embedding model built by time domain convolution, and mapping the video features X to a feature space of a weak supervision time sequence action positioning task so as to learn more characteristic embedded features XinIt can be expressed by the following formula:
Xin=ReLU(Conv(X,θemb))
in the formula:
Figure BDA0003435362070000021
s is a characteristic dimension, θembEmbedding models for trainable featuresThe parameter, ReLU, is the activation function. Then, two branches are designed in the action positioning network, namely a classification branch and an attention branch.
In the classification branch, a classification model is constructed through time domain convolution, and video features X are embeddedinMapping to action category feature space to obtain original time domain class activation sequence
Figure BDA0003435362070000022
And representing the probability distribution of the action in time, wherein c is the number of action categories, and the (c + 1) th dimension corresponds to the background category. This process can be expressed as:
F=Conv(Xin,θcls)
in the formula, thetaclsAre trainable classification model parameters. In order to enable a network to separate a significant background segment and a significant action segment and detect a fuzzy action segment in a video, the invention designs an attention model with three branches based on time domain convolution to respectively model the significant action, the significant background and the fuzzy action. The output of the model is the attention weight
Figure BDA0003435362070000023
Wherein a isact,aambAnd abkdCorresponding to the probability distribution over time of the salient motion, the fuzzy motion and the salient background, respectively. The specific process is as follows:
Att=Softmax(Conv(Xin,θatt))
in the formula, thetaattAre trainable attention model parameters. In order to distinguish the salient motion, the fuzzy motion and the salient background in the video features, based on the three attention weights and the original time domain class activation sequence F, a corresponding time domain class activation sequence CAS is constructedact、CASambAnd CASbkd. Wherein the content of the first and second substances,
Figure BDA0003435362070000031
can be formulated as:
CASact=aact*F
similarly, the method for describing the fuzzy motion and the significant background can be obtained respectively
Figure BDA0003435362070000032
And
Figure BDA0003435362070000033
in order to evaluate the loss of each time domain class activation sequence, the invention obtains a video level action class score by pooling the class activation values of the aggregated video segment through top-k, taking F as an example, and expressing the F as a formula:
Figure BDA0003435362070000034
in the formula: l ∈ {1, 2., T }, | l | ═ k ═ max (1, T// r), and r is a preset parameter. And finally, applying a Softmax function to the category dimension to obtain the category score of the video-level action, and calculating the classification loss by adopting a cross entropy function:
Figure BDA0003435362070000035
Figure BDA0003435362070000036
in the formula: j 1, 2, c +1,
Figure BDA0003435362070000037
for the probability that the video contains the action j,
Figure BDA0003435362070000038
a classification loss function for the original time-domain class activation sequence. Similarly, based on the CAS sequence of time domain class activationact、CASambAnd CASbkdCorresponding classification loss function can be obtained
Figure BDA0003435362070000039
And
Figure BDA00034353620700000310
in which a significant action temporal class activation sequence CASactThe loss function of (d) is:
Figure BDA00034353620700000311
in the formula:
Figure BDA00034353620700000312
kact=max(1,T//ract),ractpresetting parameters;
in which the fuzzy action temporal class activation sequence CASambThe loss function of (d) is:
Figure BDA00034353620700000313
in the formula:
Figure BDA0003435362070000041
k′amb=max(1,T//r′amb),r′ambpresetting parameters;
wherein the significant background time domain activation-like sequence CASbkdThe loss function of (d) is:
Figure BDA0003435362070000042
in the formula:
Figure BDA0003435362070000043
kbkd=max(1,T//rbkd),rbkdare preset parameters.
The above process makes it difficult to directly locate blurred motion segments in complex un-clipped video. Therefore, the invention designs a fuzzy motion contrast loss function to refine the video characteristics. Firstly, the methodAccording to the attention of the significant movement aactPooling on embedded features X with top-kinOn-capture salient motion features
Figure BDA0003435362070000044
Figure BDA0003435362070000045
In the formula: k is a radical ofact=max(1,T//ract) Is a hyperparameter of ractThe sampling rate of the salient motion features is controlled for the preset parameters. topk (k, x) is the time index of the k maxima in the truncation. By the same method, the obvious background feature can be obtained
Figure BDA0003435362070000046
Figure BDA0003435362070000047
Wherein the parameter is XactThe parameters of (a) are similar. Due to attention weight aambFocusing on the significant motion and the fuzzy motion at the same time, the fuzzy motion characteristics are difficult to directly acquire, and the significant motion weight is slightly larger than the fuzzy motion weight. Thus, in aambThe time indexes corresponding to the significant motion features and the significant background features are removed first. Is formulated as follows:
Figure BDA0003435362070000048
then, the similar top-k pooling is adopted to obtain fuzzy action characteristics
Figure BDA0003435362070000049
Figure BDA00034353620700000410
Wherein the parameter is XactThe parameters of (a) are similar. And finally, applying an InfonCE loss function to the video segment level, calculating the contrast loss of the fuzzy action, and refining the characteristics of the fuzzy action. Hypothesis selection of fuzzy motion features
Figure BDA00034353620700000411
Salient motion features
Figure BDA00034353620700000412
And salient background features
Figure BDA00034353620700000413
Introducing an InfonCE loss function:
Figure BDA00034353620700000414
in the formula: in the formula: τ is a temperature constant, xact~Xact
Figure BDA0003435362070000051
topk (k, x) is the time index of the k maxima in the truncation;
Figure BDA0003435362070000052
Figure BDA0003435362070000053
Figure BDA0003435362070000054
kamb=max(1,T//ramb),rambis a preset parameter and is used for controlling the sampling rate of the fuzzy characteristic. k is a radical ofbkdFor hyper-parameters, for controlling the salient background features Xbkdτ is 0.07, which is a temperature constant. In addition to the above-mentioned loss function, the introduction of the L1 loss function ensures a significant action attention weight aactSparsity of (c):
Figure BDA0003435362070000055
finally, combining all loss functions to calculate the total loss function LtotalAnd the network is converged by optimizing training
Figure BDA0003435362070000056
Where α and β are the respective loss coefficients.
CAS in the testing phaseactMore accurate modeling of action profiles, and therefore by CASactObtaining a category score p for a video levelactAnd setting a threshold value thetaclsAt p ofactMedium screening out higher than thetaclsAction category c ofact. Then to CASactIn class cactAnd acquiring a large number of action nominations by adopting a multi-threshold segmentation strategy on the corresponding dimension. Nominating a certain action (t)s,te,cact) The confidence score is calculated by the following formula
Figure BDA0003435362070000057
Figure BDA0003435362070000058
Figure BDA0003435362070000059
Figure BDA00034353620700000510
Wherein, tsAnd teRespectively the start and end times of the motion,/i=(te-ts) Mu is a preset parameter. And finally, removing redundant nominations by adopting a non-maximum suppression algorithm to obtain a final action positioning result.
The invention has the following advantages and beneficial effects:
1. the invention provides a weak supervision time sequence action positioning method based on comparison learning. In the training process, only the action category labels at the video level are used as the monitoring information, and time labels for manually marking actions are not needed, so that the consumption of manpower and material resources is greatly reduced.
2. The method and the device respectively model the remarkable action, the fuzzy action and the remarkable background in the video through the multi-branch attention model, can effectively separate the action characteristic and the background characteristic in the video, and obviously improve the action positioning accuracy on different data sets.
3. The invention designs the fuzzy action comparison loss function, can refine the video characteristics under the guidance of the obvious characteristics, enables the network to sense more accurate time boundary, avoids the action positioning result from being truncated, and effectively improves the action positioning precision.
4. According to the invention, a better result compared with the current mainstream action positioning model can be obtained without introducing a cyclic neural network into the action positioning network, the defect that the cyclic neural network is easy to have gradient disappearance is overcome, the calculated amount of the network is reduced, and the training speed of the network is accelerated.
Drawings
Fig. 1 is a network structure of a weak supervised timing action positioning method based on contrast learning according to the present invention.
Fig. 2 is a visualization result display diagram according to an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the following embodiments, which are illustrative only and not limiting, and the scope of the present invention is not limited thereby.
The invention relates to a weak supervision time sequence action positioning method based on comparison learning, which adopts a staged training mode of a feature extraction network and an action positioning network, models significant actions, fuzzy actions and significant backgrounds in a video by introducing a multi-branch attention model, effectively separates the action features and the background features in the video, introduces a fuzzy action comparison loss function, refines the video features under the guidance of the significant features, enables the network to perceive more accurate time boundaries, avoids the action positioning result from being truncated, and effectively improves the action positioning accuracy.
Fig. 1 is a network structure of the weakly supervised timing action positioning method based on contrast learning according to the present invention.
The whole framework of the invention mainly comprises two networks, including a feature extraction network and an action positioning network.
Wherein the feature extraction network adopts an I3D network pre-trained on a Kinetics data set as a main part. The network takes a 3D inclusion model as a backbone, and 4 pooling layers with the time step length of 2 are inserted between the pooling layers for controlling the parameter quantity of the network, so that the time sequence characteristics can be fused, the size of a receptive field can be reasonably controlled, and the loss of detail information can be prevented. The action positioning network consists of a feature embedding model, a classification model and a multi-branch attention model, and the three networks are all built by adopting a time domain convolution network so as to better capture the time sequence features of the video.
The data set adopted by the invention is a THUMOS-14 data set and an activityNet-1.2 data set. The thumb-14 dataset contains a total of 20 types of actions, with each video containing an average of 15.4 action segments, all data being obtained from the Youtube website. Video lengths vary from tens of seconds to tens of minutes, a challenging set of data for the task of weakly supervised timing action localization. As with the previous mainstream algorithmic partitioning of data sets, the present invention employs 200 verification videos with time stamps therein as a training set and 213 test videos as a test set. ActivityNet-1.2 is a large time series action positioning data set. In total, 100 classes of motion were included, the training set included 4819 videos, and the test set included 2382 videos. On average, each video contains 1.5 motion segments and 36% background, with a significantly reduced proportion of motion segments compared to the thumb-14 dataset.
Firstly, continuously dividing each 16 frames of an uncut video as a video segment to obtain T video blocks, sending the video blocks into a pre-trained feature extraction network to extract RGB (red, green, blue) features and optical flow features, and cascadingAnd obtaining video characteristics X, wherein the characteristic extraction network does not participate in subsequent weak supervision training. Then, the video features X are sent into a feature embedding model built by time domain convolution, and the feature embedding model is mapped to a feature space of a weak supervision time sequence action positioning task so as to learn more characteristic embedding features
Figure BDA0003435362070000071
s is the characteristic dimension and T is the characteristic dimension.
Next, it is necessary to utilize the embedding feature XinA time domain class activation sequence for a positioning action is obtained. Therefore, the invention designs two branches in the action positioning network, namely a classification branch and an attention branch.
In the classification branch, a classification model is constructed through time domain convolution, and video features X are embeddedinMapping to action category feature space to obtain original time domain class activation sequence
Figure BDA0003435362070000072
And representing the probability distribution of the action in time, wherein c is the number of action categories, and the (c + 1) th dimension corresponds to the background category.
However, with only the original temporal class activation sequence F, it is difficult for the network to separate the salient motion and salient background in the video. In order to enable a network to separate a significant background segment and a significant action segment and detect a fuzzy action segment in a video, an attention model with three branches is designed based on time domain convolution to respectively model the significant action, the significant background and the fuzzy action, attention weights of the three are obtained, and a Softmax function is adopted to carry out normalization processing on output results of the attention model. The output of the model is the attention weight
Figure BDA0003435362070000073
Figure BDA0003435362070000074
Wherein a isact,aambAnd abkdCorresponding to the probability distribution over time of the salient motion, the fuzzy motion and the salient background, respectively. In particular toThe process is as follows:
Att=Softmax(Conv(Xin,θatt))
in the formula, thetaattAre trainable attention model parameters. In order to distinguish the salient motion, the fuzzy motion and the salient background in the video features, based on the three attention weights and the original time domain class activation sequence F, a corresponding time domain class activation sequence CAS is constructedact、CASambAnd CASbkd. Wherein the content of the first and second substances,
Figure BDA0003435362070000075
can be formulated as:
CASact=aact*F
similarly, the method for describing the fuzzy motion and the significant background can be obtained respectively
Figure BDA0003435362070000081
And
Figure BDA0003435362070000082
wherein, CASactThe method comprises the steps of having a higher activation value at a time position of a significant action in a video and being suppressed at a time position of a significant background; CAS (CAS)bkdWith higher activation values at temporal locations of significant background. Thus, based on CASactAnd CASbkdThe network can separate salient motion from salient background in the video. CAS (CAS)ambWith higher activation values at both the time locations of salient and blurred motion.
The invention aggregates the salient features in the video through a multi-example learning mechanism and supervises the training process of the network. Regarding the whole un-clipped video as a multi-instance package, each video segment is taken as an example, and each video segment can obtain the corresponding class activation value by the previous method. In order to evaluate the loss of each time domain class activation sequence, the invention obtains a video level action class score by pooling the class activation values of the aggregated video segment through top-k, taking F as an example, and expressing the F as a formula:
Figure BDA0003435362070000083
in the formula: l ∈ {1, 2., T }, | l | ═ k ═ max (1, T// r), and r is a preset parameter. And finally applying a Softmax function to the category dimension to obtain the category score of the video-level action, and calculating the classification loss by adopting a cross entropy function:
Figure BDA0003435362070000084
Figure BDA0003435362070000085
in the formula: j 1, 2, c +1,
Figure BDA0003435362070000086
for the probability that the video contains the action j,
Figure BDA0003435362070000087
a classification loss function for the original time-domain class activation sequence. Similarly, based on the CAS sequence of time domain class activationact、CASambAnd CASbkdCorresponding classification loss function can be obtained
Figure BDA0003435362070000088
And
Figure BDA0003435362070000089
the invention regards a whole un-clipped video as a multi-example packet containing actions and backgrounds at the same time, and the category label of the original time domain class activation sequence is set as yj=1,y c+11. Second, to guarantee CASactAnd CASbkdCorresponding attention aactAnd abkdRespectively paying attention to the salient motion and the salient background in the video, and respectively setting the category labels of the salient motion and the salient background as yj=1,y c+10 and yj=0,y c+11. In addition, to locate blurred motion in video, the present inventionThe invention sets the CASambClass label of yj=1,y c+11, let aambThe method can focus on the remarkable action with a high activation value and the fuzzy action with a relatively low activation value in the video at the same time.
Although the above process can realize the separation of the action and the background by using a multi-branch attention model, the network lacks guidance of action time scale information, is difficult to directly position a fuzzy action segment in a complex un-clipped video, and cannot ensure the integrity of a positioning result. While the blurred motion segment tends to be temporally adjacent to the prominent motion segment and away from the prominent background segment. Furthermore, its attention weight will be slightly lower than the significant action attention weight, but significantly greater than the significant background attention weight. Based on the thought, the invention provides a simple and effective method for positioning the fuzzy action segment in the video, and designs the fuzzy action contrast loss function to refine the video characteristics, so that the network can be positioned to more complete action. First, according to the attention a of the significant movementactPooling on embedded features X with top-kinOn-capture salient motion features
Figure BDA0003435362070000091
Figure BDA0003435362070000092
In the formula: k is a radical ofact=max(1,T//ract) Is a hyperparameter of ractThe sampling rate of the salient motion features is controlled for the preset parameters. topk (k, x) is the time index of the k maxima in the truncation. By the same method, the obvious background feature can be obtained
Figure BDA0003435362070000093
Figure BDA0003435362070000094
Wherein the parameter is XactThe parameters of (a) are similar. Due to the fact thatGravity weight aambFocusing on the significant motion and the fuzzy motion at the same time, the fuzzy motion characteristics are difficult to directly acquire, and the significant motion weight is slightly larger than the fuzzy motion weight. Thus, in aambThe time indexes corresponding to the significant motion features and the significant background features are removed first. Is formulated as follows:
Figure BDA0003435362070000095
then, the similar top-k pooling is adopted to obtain fuzzy action characteristics
Figure BDA0003435362070000096
Figure BDA0003435362070000097
Wherein the parameter is XactThe parameters of (a) are similar. And finally, applying an InfonCE loss function to the video segment level, calculating the contrast loss of the fuzzy action, and refining the characteristics of the fuzzy action. And constructing a positive sample pair by using the significant motion features and the fuzzy motion features, and constructing a negative sample pair by using the significant background features and the fuzzy motion features, so that the significant motion and the fuzzy motion are driven to be more compact in a feature space, and the significant background and the fuzzy motion are far away from each other. Hypothesis selection of fuzzy motion features
Figure BDA0003435362070000098
Salient motion features
Figure BDA0003435362070000099
And salient background features
Figure BDA00034353620700000910
Introducing an InfonCE loss function:
Figure BDA00034353620700000911
in the formula: k is a radical ofbkdFor hyper-parameters, for controlling the salient background feature Xbkdτ is 0.07, which is a temperature constant. The loss function may maximize mutual information between the salient motion segments and the blurred motion segments. Therefore, in the process of each round of iterative training, the network continuously finds new fuzzy motion characteristics and compares the new fuzzy motion characteristics with the obvious characteristics, so that the characteristic information in the real motion range is richer, the identifiability of the characteristic distribution is improved, and the occurrence process of complete motion is captured. In addition to the above-mentioned loss function, the introduction of the L1 loss function ensures a significant action attention weight aactSparsity of (c):
Figure BDA0003435362070000101
finally, combining all loss functions to calculate the total loss function LtotalAnd the network is converged through Adam optimizer training:
Figure BDA0003435362070000102
where α and β are the respective loss coefficients.
CAS in the testing phaseactMore accurate modeling of action profiles, and therefore by CASactObtaining a category score p for a video levelactAnd setting a threshold value thetaclsAt p ofactMedium screening out higher than thetaclsAction category cact. Then to CASactIn class cactAnd acquiring a large number of action nominations by adopting a multi-threshold segmentation strategy on the corresponding dimension. Nominating a certain action (t)s,te,cact) The confidence score is calculated by the following formula
Figure BDA0003435362070000103
Figure BDA0003435362070000104
Figure BDA0003435362070000105
Figure BDA0003435362070000106
Wherein, tsAnd teRespectively the start and end times of the motion,/i=(te-ts) Mu is a preset parameter. And finally, removing redundant nomination by adopting a non-maximum suppression algorithm to obtain a final action positioning result.
The invention adopts a pytorch deep learning framework to carry out experiments, and the specific parameters are shown in the following table 1:
TABLE 1
Figure BDA0003435362070000107
Figure BDA0003435362070000111
The model was trained to converge and evaluated on the THUMOS-14 dataset and the activityNet-1.2 dataset. The results of the evaluation are shown in tables 2 and 3, respectively, from which it can be seen that the motion localization accuracy of the present method on both data sets exceeds the previous mainstream method.
TABLE 2
Figure BDA0003435362070000112
TABLE 3
Figure BDA0003435362070000113
FIG. 2 shows the visualization results of the method of the present invention based on the prior best method HAM-Net. (a) The movements in the middle correspond to the weight lifting process of the athlete, and in two stages of 'picking up the barbell from the ground' (picture [1]) and 'lifting up the barbell to the top of the head' (picture [4]), the movement amplitude is larger, and the movement characteristics are more obvious; and there are significant scene cuts in the background [5], which can be easily located by the baseline method. However, during the weight lifting process, the player lifts the barbell upright and stops at the chest position (picture [2]), and the process is obviously switched to the lens (picture [3 ]). The process is difficult to capture without time supervision information, but the method of the invention can locate it completely. (b) The video comprises a plurality of golf actions, the 3 rd action segment is wholly slowly played, and the reference method can detect partial golf playing processes. But players tend to pause with power when the club is at the highest and lowest points (picture [1] [3] [5]), and because of the slow motion, the motion characteristics at these time positions are more blurred and difficult to distinguish from the static background. The positioning result shows that the method can solve the problem, and the positioning result of other actions is not influenced while the 3 rd action is completely positioned, so that the effectiveness of the method is fully embodied.
The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution of the present invention and the inventive concept thereof within the scope of the present invention.

Claims (8)

1. A weak supervision time sequence action positioning method based on comparison learning is characterized by comprising the following steps: the method comprises the following steps:
1) constructing a feature extraction network and an action positioning network, wherein the action positioning network comprises two branches which respectively correspond to a classification model and a multi-branch attention model;
2) method and net for constructing staged weak supervision trainingLearning by the network only under the supervision of a video-level action category label, processing an original video sequence, respectively sending RGB data and optical flow data into a pre-trained feature extraction network to extract features, cascading to obtain video features X, then sending the video features X into a feature embedding model, mapping the feature embedding model to a feature space of a weak supervision time sequence action positioning task, and obtaining embedded features Xin
3) To embed feature XinInputting a classification model to obtain an original time domain class activation sequence F;
4) to embed feature XinInputting a multi-branch attention model to obtain a significant action attention weight aactAttention weight of fuzzy motion aambAnd a significant background attention weight abkdAnd three corresponding time domain activation sequences are constructed, namely a remarkable action time domain activation sequence CAS respectivelyactFuzzy action time domain class activation sequence CASambAnd significant background time-domain class activation sequence CASbkd(ii) a The output of the multi-branch attention model is the attention weight after normalization processing;
5) according to the attention weight after normalization processing, positive and negative sample pairs are constructed, and a fuzzy action contrast loss function L is calculatedconCombining the loss functions to calculate the total loss function LtotalAnd the network is converged by optimizing training;
6) during the test phase, the CAS sequence is activated for the time domain classactAnd performing threshold segmentation to obtain a large number of action nominations, and finally removing redundant nominations by adopting a non-maximum suppression algorithm to obtain a final action positioning result.
2. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the feature extraction network in the step 1) adopts an I3D network pre-trained on a Kinetics data set, the 13D network does not participate in subsequent weak supervision training, and the classification model and the multi-branch attention model are built by adopting a time domain convolution network.
3. According to claimThe weak supervision time sequence action positioning method based on comparative learning, as claimed in claim 1, is characterized in that: step 2) the pre-trained feature extraction network is an I3D network, and the embedded feature X isinThe calculation formula of (c) is:
Xin=ReLU(Conv(X,θemb))
in the formula:
Figure FDA0003435362060000011
s is a characteristic dimension, T is a time dimension, θembModel parameters are embedded for trainable features, ReLU as an activation function.
4. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: in step 3)
F=Conv(Xin,θcls)
In the formula, thetaclsAre trainable classification model parameters.
5. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the attention weight after the normalization processing in step 4):
Att=Softmax(Conv(Xin,θatt))
in the formula, thetaattFor the parameters of the trainable attention model,
Figure FDA0003435362060000029
6. the weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the loss function of the original time domain type activation sequence F is as follows:
Figure FDA0003435362060000021
in the formula:
Figure FDA0003435362060000022
for the probability that the video contains the action j,
Figure FDA0003435362060000023
l ∈ {1, 2,. and T }, | l | ═ k ═ max (1, T// r), r is a preset parameter, and j ═ 1, 2,. and c + 1;
the significant action temporal class activation sequence CASactThe loss function of (d) is:
Figure FDA0003435362060000024
in the formula:
Figure FDA0003435362060000025
kact=max(1,T//ract),ractpresetting parameters;
the fuzzy action time domain class activation sequence CASambThe loss function of (d) is:
Figure FDA0003435362060000026
in the formula:
Figure FDA0003435362060000027
k′amb=max(1,T//r′amb),r′ambpresetting parameters;
the significant background time domain class activation sequence CASbkdThe loss function of (d) is:
Figure FDA0003435362060000028
in the formula:
Figure FDA0003435362060000031
kbkd=max(1,T//rbkd),rbkdare preset parameters.
7. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: step 5) of
Figure FDA0003435362060000032
In the formula: τ is a temperature constant, xact~Xact
Figure FDA0003435362060000039
topk (k, x) is the time index of the k maxima in the truncation; x is the number ofbkd~Xbkd
Figure FDA00034353620600000310
Figure FDA00034353620600000311
xamb~Xamb
Figure FDA0003435362060000033
Figure FDA0003435362060000034
kamb=max(1,T//ramb),rambIs a preset parameter and is used for controlling the sampling rate of the fuzzy characteristic.
8. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the specific method of the step 6) comprises the following steps: by CAS in the testing phaseactObtaining a category score p for a video levelactAnd setting a threshold value thetaclsAt p ofactMedium screening out higher than thetaclsAction category cactThen to CASactIn class cactObtaining a large number of action nominations by adopting a multi-threshold segmentation strategy on corresponding dimensions, and nominating the action (t)s,te,cact) The confidence score is calculated by the following formula
Figure FDA0003435362060000035
Figure FDA0003435362060000036
Figure FDA0003435362060000037
Figure FDA0003435362060000038
Wherein, tsAnd teRespectively the start and end times of the motion,/i=(te-ts) And 4, mu is a preset parameter, and finally, a non-maximum suppression algorithm is adopted to remove redundant nomination so as to obtain a final action positioning result.
CN202111610682.4A 2021-12-27 2021-12-27 Comparison learning-based weak supervision time sequence action positioning method Pending CN114494941A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111610682.4A CN114494941A (en) 2021-12-27 2021-12-27 Comparison learning-based weak supervision time sequence action positioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111610682.4A CN114494941A (en) 2021-12-27 2021-12-27 Comparison learning-based weak supervision time sequence action positioning method

Publications (1)

Publication Number Publication Date
CN114494941A true CN114494941A (en) 2022-05-13

Family

ID=81495834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111610682.4A Pending CN114494941A (en) 2021-12-27 2021-12-27 Comparison learning-based weak supervision time sequence action positioning method

Country Status (1)

Country Link
CN (1) CN114494941A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (en) * 2023-03-30 2023-04-28 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium
CN116503959A (en) * 2023-06-30 2023-07-28 山东省人工智能研究院 Weak supervision time sequence action positioning method and system based on uncertainty perception

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (en) * 2023-03-30 2023-04-28 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium
CN116503959A (en) * 2023-06-30 2023-07-28 山东省人工智能研究院 Weak supervision time sequence action positioning method and system based on uncertainty perception
CN116503959B (en) * 2023-06-30 2023-09-08 山东省人工智能研究院 Weak supervision time sequence action positioning method and system based on uncertainty perception

Similar Documents

Publication Publication Date Title
CN108830252A (en) A kind of convolutional neural networks human motion recognition method of amalgamation of global space-time characteristic
CN109949317A (en) Based on the semi-supervised image instance dividing method for gradually fighting study
CN108600865B (en) A kind of video abstraction generating method based on super-pixel segmentation
CN114494941A (en) Comparison learning-based weak supervision time sequence action positioning method
CN106682108A (en) Video retrieval method based on multi-modal convolutional neural network
TWI712316B (en) Method and device for generating video summary
CN110348364B (en) Basketball video group behavior identification method combining unsupervised clustering and time-space domain depth network
CN106709453A (en) Sports video key posture extraction method based on deep learning
CN108491766B (en) End-to-end crowd counting method based on depth decision forest
CN106529477A (en) Video human behavior recognition method based on significant trajectory and time-space evolution information
CN109886165A (en) A kind of action video extraction and classification method based on moving object detection
CN112906631B (en) Dangerous driving behavior detection method and detection system based on video
CN108509939A (en) A kind of birds recognition methods based on deep learning
CN110210383A (en) A kind of basketball video Context event recognition methods of fusional movement mode and key visual information
CN111462162A (en) Foreground segmentation algorithm for specific class of pictures
CN114049581A (en) Weak supervision behavior positioning method and device based on action fragment sequencing
CN115410119A (en) Violent movement detection method and system based on adaptive generation of training samples
CN110968721A (en) Method and system for searching infringement of mass images and computer readable storage medium thereof
CN111191531A (en) Rapid pedestrian detection method and system
Zhao et al. Action recognition based on C3D network and adaptive keyframe extraction
CN105893967B (en) Human behavior classification detection method and system based on time sequence retention space-time characteristics
US20230290118A1 (en) Automatic classification method and system of teaching videos based on different presentation forms
CN108491751A (en) A kind of compound action recognition methods of the exploration privilege information based on simple action
Alwassel et al. Action search: Learning to search for human activities in untrimmed videos
CN116935303A (en) Weak supervision self-training video anomaly detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination