CN114494941A

CN114494941A - Comparison learning-based weak supervision time sequence action positioning method

Info

Publication number: CN114494941A
Application number: CN202111610682.4A
Authority: CN
Inventors: 侯永宏; 李岳阳; 张浩元; 张文静; 刘传玉
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-05-13

Abstract

The invention discloses a weak supervision time sequence action positioning method based on contrast learning, which is used for positioning interested actions from un-clipped videos only under the supervision of a video-level action category label. Firstly, extracting video characteristics from RGB data and optical flow data of an original video by using a pre-trained characteristic extraction network, and sending the video characteristics into a subsequent action positioning network. The action positioning network comprises two branches, wherein one branch maps video characteristics into an original time domain class activation sequence (T-CAS); the other branch is a multi-branch attention model, models a significant action segment, a background segment and a fuzzy action segment in the video respectively, generates three corresponding time domain class activation sequences simultaneously, and enables the network to obtain the capability of separating action features and background features through a multi-example learning (MIL) mechanism. The invention can sense the accurate action time boundary in the un-edited video, avoid the occurrence of the truncation phenomenon of the complete action and greatly improve the action positioning precision.

Description

Comparison learning-based weak supervision time sequence action positioning method

Technical Field

The invention belongs to the fields of computer vision, deep learning and the like, relates to a video positioning technology, and particularly relates to a weak supervision time sequence action positioning method based on contrast learning.

Background

In recent years, with the development of deep learning, the video understanding field has made a very significant breakthrough. The time sequence action positioning is taken as a research hotspot in the field of video understanding, and has great application potential in various real scenes, such as video monitoring, anomaly detection, video retrieval and the like. The main task is to pinpoint the start and end times of the action of interest from the long duration of the un-clipped video and to classify the action correctly. At present, the timing action positioning is mostly trained in a full supervision mode, and the key is to collect enough unfragmented videos labeled frame by frame. However, in the real world, marking massive video data frame by frame requires a large amount of manpower and material resources; in addition, due to the abstraction of the action, the time label of the artificial labeling action is easily influenced by subjective factors of people, so that the labeling information is wrong. Therefore, the time sequence action positioning based on weak supervised learning is derived, and only the action class label of the video level is used as the supervised information in the network training process. Compared with an accurate action time label, the action category label is easier to obtain, and the deviation caused by manual labeling can be effectively avoided.

The existing weak supervision time sequence action positioning method can be divided into two types: the method is inspired by a semantic segmentation technology, weak supervision time sequence action positioning and mapping are carried out to be an action classification problem, a action-background separation mechanism is introduced to construct video level characteristics, and finally a video is identified through an action classifier. And the other method is to express the time sequence action positioning as a multi-example learning task, regard the whole un-clipped video as a multi-example package simultaneously containing a positive sample and a negative sample, wherein the positive sample and the negative sample respectively correspond to an action segment and a background segment in the video, obtain a time domain class activation sequence through a classifier to further describe the probability distribution of the action on time, adopt top-k pooling to aggregate video-level class scores, and finally set a threshold value for the time domain class activation sequence to position the action.

The two methods solve the positioning problem in the un-edited video by learning an effective classification loss function, and although a certain effect can be obtained, similar to most weak supervision learning methods, due to the lack of time labels, the network is difficult to model a complete action generating process, the most significant part in the action can be over-concerned, and some secondary areas with unobvious features are ignored. Furthermore, since the video is not artificially clipped, there are often blurred frames of shot transitions, motion slow-release, etc. in a complete motion, which are semantically related to the motion and are part of the motion, but the motion features are not obvious, resulting in low activation values at these time positions, which are difficult to distinguish from a prominent background segment with the same low activation value, and which are falsely detected as background frames. Therefore, the fuzzy action characteristics in the video are found and refined, so that the network captures more complete action segments, and the method has important significance for improving the weak supervision time sequence action positioning performance.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a weak supervision time sequence action positioning method based on contrast learning. The feature extraction network and the action positioning network are trained separately, the remarkable action, the fuzzy action and the remarkable background in the video are modeled respectively through the multi-branch attention model, and the fuzzy action contrast loss function is introduced to refine the video features, so that the network senses more accurate time boundary, and the action positioning precision is effectively improved.

The invention adopts the following technical scheme for solving the technical problems:

firstly, extracting RGB (red, green and blue) features and optical flow features of an original video by adopting a pre-trained I3D network, cascading to obtain video features X, sending the video features X into a feature embedding model built by time domain convolution, and mapping the video features X to a feature space of a weak supervision time sequence action positioning task so as to learn more characteristic embedded features X_inIt can be expressed by the following formula:

X_in＝ReLU(Conv(X，θ_emb))

in the formula:

s is a characteristic dimension, θ_embEmbedding models for trainable featuresThe parameter, ReLU, is the activation function. Then, two branches are designed in the action positioning network, namely a classification branch and an attention branch.

In the classification branch, a classification model is constructed through time domain convolution, and video features X are embedded_inMapping to action category feature space to obtain original time domain class activation sequence

And representing the probability distribution of the action in time, wherein c is the number of action categories, and the (c + 1) th dimension corresponds to the background category. This process can be expressed as:

F＝Conv(X_in，θ_cls)

in the formula, theta_clsAre trainable classification model parameters. In order to enable a network to separate a significant background segment and a significant action segment and detect a fuzzy action segment in a video, the invention designs an attention model with three branches based on time domain convolution to respectively model the significant action, the significant background and the fuzzy action. The output of the model is the attention weight

Wherein a is_act，a_ambAnd a_bkdCorresponding to the probability distribution over time of the salient motion, the fuzzy motion and the salient background, respectively. The specific process is as follows:

Att＝Softmax(Conv(X_in，θ_att))

in the formula, theta_attAre trainable attention model parameters. In order to distinguish the salient motion, the fuzzy motion and the salient background in the video features, based on the three attention weights and the original time domain class activation sequence F, a corresponding time domain class activation sequence CAS is constructed_act、CAS_ambAnd CAS_bkd. Wherein the content of the first and second substances,

can be formulated as:

CAS_act＝a_act*F

similarly, the method for describing the fuzzy motion and the significant background can be obtained respectively

And

in order to evaluate the loss of each time domain class activation sequence, the invention obtains a video level action class score by pooling the class activation values of the aggregated video segment through top-k, taking F as an example, and expressing the F as a formula:

in the formula: l ∈ {1, 2., T }, | l | ═ k ═ max (1, T// r), and r is a preset parameter. And finally, applying a Softmax function to the category dimension to obtain the category score of the video-level action, and calculating the classification loss by adopting a cross entropy function:

in the formula: j 1, 2, c +1,

for the probability that the video contains the action j,

a classification loss function for the original time-domain class activation sequence. Similarly, based on the CAS sequence of time domain class activation_act、CAS_ambAnd CAS_bkdCorresponding classification loss function can be obtained

And

in which a significant action temporal class activation sequence CAS_actThe loss function of (d) is:

in the formula:

k_act＝max(1，T//r_act)，r_actpresetting parameters;

in which the fuzzy action temporal class activation sequence CAS_ambThe loss function of (d) is:

in the formula:

k′_amb＝max(1，T//r′_amb)，r′_ambpresetting parameters;

wherein the significant background time domain activation-like sequence CAS_bkdThe loss function of (d) is:

in the formula:

k_bkd＝max(1，T//r_bkd)，r_bkdare preset parameters.

The above process makes it difficult to directly locate blurred motion segments in complex un-clipped video. Therefore, the invention designs a fuzzy motion contrast loss function to refine the video characteristics. Firstly, the methodAccording to the attention of the significant movement a_actPooling on embedded features X with top-k_inOn-capture salient motion features

In the formula: k is a radical of_act＝max(1，T//r_act) Is a hyperparameter of r_actThe sampling rate of the salient motion features is controlled for the preset parameters. topk (k, x) is the time index of the k maxima in the truncation. By the same method, the obvious background feature can be obtained

Wherein the parameter is X_actThe parameters of (a) are similar. Due to attention weight a_ambFocusing on the significant motion and the fuzzy motion at the same time, the fuzzy motion characteristics are difficult to directly acquire, and the significant motion weight is slightly larger than the fuzzy motion weight. Thus, in a_ambThe time indexes corresponding to the significant motion features and the significant background features are removed first. Is formulated as follows:

then, the similar top-k pooling is adopted to obtain fuzzy action characteristics

Wherein the parameter is X_actThe parameters of (a) are similar. And finally, applying an InfonCE loss function to the video segment level, calculating the contrast loss of the fuzzy action, and refining the characteristics of the fuzzy action. Hypothesis selection of fuzzy motion features

Salient motion features

And salient background features

Introducing an InfonCE loss function:

in the formula: in the formula: τ is a temperature constant, x_act～X_act，

topk (k, x) is the time index of the k maxima in the truncation;

k_amb＝max(1，T//r_amb)，r_ambis a preset parameter and is used for controlling the sampling rate of the fuzzy characteristic. k is a radical of_bkdFor hyper-parameters, for controlling the salient background features X_bkdτ is 0.07, which is a temperature constant. In addition to the above-mentioned loss function, the introduction of the L1 loss function ensures a significant action attention weight a_actSparsity of (c):

finally, combining all loss functions to calculate the total loss function L_totalAnd the network is converged by optimizing training

Where α and β are the respective loss coefficients.

CAS in the testing phase_actMore accurate modeling of action profiles, and therefore by CAS_actObtaining a category score p for a video level_actAnd setting a threshold value theta_clsAt p of_actMedium screening out higher than theta_clsAction category c of_act. Then to CAS_actIn class c_actAnd acquiring a large number of action nominations by adopting a multi-threshold segmentation strategy on the corresponding dimension. Nominating a certain action (t)_s，t_e，c_act) The confidence score is calculated by the following formula

Wherein, t_sAnd t_eRespectively the start and end times of the motion,/_i＝(t_e-t_s) Mu is a preset parameter. And finally, removing redundant nominations by adopting a non-maximum suppression algorithm to obtain a final action positioning result.

The invention has the following advantages and beneficial effects:

1. the invention provides a weak supervision time sequence action positioning method based on comparison learning. In the training process, only the action category labels at the video level are used as the monitoring information, and time labels for manually marking actions are not needed, so that the consumption of manpower and material resources is greatly reduced.

2. The method and the device respectively model the remarkable action, the fuzzy action and the remarkable background in the video through the multi-branch attention model, can effectively separate the action characteristic and the background characteristic in the video, and obviously improve the action positioning accuracy on different data sets.

3. The invention designs the fuzzy action comparison loss function, can refine the video characteristics under the guidance of the obvious characteristics, enables the network to sense more accurate time boundary, avoids the action positioning result from being truncated, and effectively improves the action positioning precision.

4. According to the invention, a better result compared with the current mainstream action positioning model can be obtained without introducing a cyclic neural network into the action positioning network, the defect that the cyclic neural network is easy to have gradient disappearance is overcome, the calculated amount of the network is reduced, and the training speed of the network is accelerated.

Drawings

Fig. 1 is a network structure of a weak supervised timing action positioning method based on contrast learning according to the present invention.

Fig. 2 is a visualization result display diagram according to an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments, which are illustrative only and not limiting, and the scope of the present invention is not limited thereby.

The invention relates to a weak supervision time sequence action positioning method based on comparison learning, which adopts a staged training mode of a feature extraction network and an action positioning network, models significant actions, fuzzy actions and significant backgrounds in a video by introducing a multi-branch attention model, effectively separates the action features and the background features in the video, introduces a fuzzy action comparison loss function, refines the video features under the guidance of the significant features, enables the network to perceive more accurate time boundaries, avoids the action positioning result from being truncated, and effectively improves the action positioning accuracy.

Fig. 1 is a network structure of the weakly supervised timing action positioning method based on contrast learning according to the present invention.

The whole framework of the invention mainly comprises two networks, including a feature extraction network and an action positioning network.

Wherein the feature extraction network adopts an I3D network pre-trained on a Kinetics data set as a main part. The network takes a 3D inclusion model as a backbone, and 4 pooling layers with the time step length of 2 are inserted between the pooling layers for controlling the parameter quantity of the network, so that the time sequence characteristics can be fused, the size of a receptive field can be reasonably controlled, and the loss of detail information can be prevented. The action positioning network consists of a feature embedding model, a classification model and a multi-branch attention model, and the three networks are all built by adopting a time domain convolution network so as to better capture the time sequence features of the video.

The data set adopted by the invention is a THUMOS-14 data set and an activityNet-1.2 data set. The thumb-14 dataset contains a total of 20 types of actions, with each video containing an average of 15.4 action segments, all data being obtained from the Youtube website. Video lengths vary from tens of seconds to tens of minutes, a challenging set of data for the task of weakly supervised timing action localization. As with the previous mainstream algorithmic partitioning of data sets, the present invention employs 200 verification videos with time stamps therein as a training set and 213 test videos as a test set. ActivityNet-1.2 is a large time series action positioning data set. In total, 100 classes of motion were included, the training set included 4819 videos, and the test set included 2382 videos. On average, each video contains 1.5 motion segments and 36% background, with a significantly reduced proportion of motion segments compared to the thumb-14 dataset.

Firstly, continuously dividing each 16 frames of an uncut video as a video segment to obtain T video blocks, sending the video blocks into a pre-trained feature extraction network to extract RGB (red, green, blue) features and optical flow features, and cascadingAnd obtaining video characteristics X, wherein the characteristic extraction network does not participate in subsequent weak supervision training. Then, the video features X are sent into a feature embedding model built by time domain convolution, and the feature embedding model is mapped to a feature space of a weak supervision time sequence action positioning task so as to learn more characteristic embedding features

s is the characteristic dimension and T is the characteristic dimension.

Next, it is necessary to utilize the embedding feature X_inA time domain class activation sequence for a positioning action is obtained. Therefore, the invention designs two branches in the action positioning network, namely a classification branch and an attention branch.

And representing the probability distribution of the action in time, wherein c is the number of action categories, and the (c + 1) th dimension corresponds to the background category.

However, with only the original temporal class activation sequence F, it is difficult for the network to separate the salient motion and salient background in the video. In order to enable a network to separate a significant background segment and a significant action segment and detect a fuzzy action segment in a video, an attention model with three branches is designed based on time domain convolution to respectively model the significant action, the significant background and the fuzzy action, attention weights of the three are obtained, and a Softmax function is adopted to carry out normalization processing on output results of the attention model. The output of the model is the attention weight

Wherein a is_act，a_ambAnd a_bkdCorresponding to the probability distribution over time of the salient motion, the fuzzy motion and the salient background, respectively. In particular toThe process is as follows:

Att＝Softmax(Conv(X_in，θ_att))

can be formulated as:

CAS_act＝a_act*F

And

wherein, CAS_actThe method comprises the steps of having a higher activation value at a time position of a significant action in a video and being suppressed at a time position of a significant background; CAS (CAS)_bkdWith higher activation values at temporal locations of significant background. Thus, based on CAS_actAnd CAS_bkdThe network can separate salient motion from salient background in the video. CAS (CAS)_ambWith higher activation values at both the time locations of salient and blurred motion.

The invention aggregates the salient features in the video through a multi-example learning mechanism and supervises the training process of the network. Regarding the whole un-clipped video as a multi-instance package, each video segment is taken as an example, and each video segment can obtain the corresponding class activation value by the previous method. In order to evaluate the loss of each time domain class activation sequence, the invention obtains a video level action class score by pooling the class activation values of the aggregated video segment through top-k, taking F as an example, and expressing the F as a formula:

in the formula: l ∈ {1, 2., T }, | l | ═ k ═ max (1, T// r), and r is a preset parameter. And finally applying a Softmax function to the category dimension to obtain the category score of the video-level action, and calculating the classification loss by adopting a cross entropy function:

in the formula: j 1, 2, c +1,

for the probability that the video contains the action j,

And

the invention regards a whole un-clipped video as a multi-example packet containing actions and backgrounds at the same time, and the category label of the original time domain class activation sequence is set as y_j＝1，y _c+11. Second, to guarantee CAS_actAnd CAS_bkdCorresponding attention a_actAnd a_bkdRespectively paying attention to the salient motion and the salient background in the video, and respectively setting the category labels of the salient motion and the salient background as y_j＝1，y _c+10 and y_j＝0，y _c+11. In addition, to locate blurred motion in video, the present inventionThe invention sets the CAS_ambClass label of y_j＝1，y _c+11, let a_ambThe method can focus on the remarkable action with a high activation value and the fuzzy action with a relatively low activation value in the video at the same time.

Although the above process can realize the separation of the action and the background by using a multi-branch attention model, the network lacks guidance of action time scale information, is difficult to directly position a fuzzy action segment in a complex un-clipped video, and cannot ensure the integrity of a positioning result. While the blurred motion segment tends to be temporally adjacent to the prominent motion segment and away from the prominent background segment. Furthermore, its attention weight will be slightly lower than the significant action attention weight, but significantly greater than the significant background attention weight. Based on the thought, the invention provides a simple and effective method for positioning the fuzzy action segment in the video, and designs the fuzzy action contrast loss function to refine the video characteristics, so that the network can be positioned to more complete action. First, according to the attention a of the significant movement_actPooling on embedded features X with top-k_inOn-capture salient motion features

Wherein the parameter is X_actThe parameters of (a) are similar. Due to the fact thatGravity weight a_ambFocusing on the significant motion and the fuzzy motion at the same time, the fuzzy motion characteristics are difficult to directly acquire, and the significant motion weight is slightly larger than the fuzzy motion weight. Thus, in a_ambThe time indexes corresponding to the significant motion features and the significant background features are removed first. Is formulated as follows:

Wherein the parameter is X_actThe parameters of (a) are similar. And finally, applying an InfonCE loss function to the video segment level, calculating the contrast loss of the fuzzy action, and refining the characteristics of the fuzzy action. And constructing a positive sample pair by using the significant motion features and the fuzzy motion features, and constructing a negative sample pair by using the significant background features and the fuzzy motion features, so that the significant motion and the fuzzy motion are driven to be more compact in a feature space, and the significant background and the fuzzy motion are far away from each other. Hypothesis selection of fuzzy motion features

Salient motion features

And salient background features

Introducing an InfonCE loss function:

in the formula: k is a radical of_bkdFor hyper-parameters, for controlling the salient background feature X_bkdτ is 0.07, which is a temperature constant. The loss function may maximize mutual information between the salient motion segments and the blurred motion segments. Therefore, in the process of each round of iterative training, the network continuously finds new fuzzy motion characteristics and compares the new fuzzy motion characteristics with the obvious characteristics, so that the characteristic information in the real motion range is richer, the identifiability of the characteristic distribution is improved, and the occurrence process of complete motion is captured. In addition to the above-mentioned loss function, the introduction of the L1 loss function ensures a significant action attention weight a_actSparsity of (c):

finally, combining all loss functions to calculate the total loss function L_totalAnd the network is converged through Adam optimizer training:

where α and β are the respective loss coefficients.

CAS in the testing phase_actMore accurate modeling of action profiles, and therefore by CAS_actObtaining a category score p for a video level_actAnd setting a threshold value theta_clsAt p of_actMedium screening out higher than theta_clsAction category c_act. Then to CAS_actIn class c_actAnd acquiring a large number of action nominations by adopting a multi-threshold segmentation strategy on the corresponding dimension. Nominating a certain action (t)_s，t_e，c_act) The confidence score is calculated by the following formula

Wherein, t_sAnd t_eRespectively the start and end times of the motion,/_i＝(t_e-t_s) Mu is a preset parameter. And finally, removing redundant nomination by adopting a non-maximum suppression algorithm to obtain a final action positioning result.

The invention adopts a pytorch deep learning framework to carry out experiments, and the specific parameters are shown in the following table 1:

TABLE 1

The model was trained to converge and evaluated on the THUMOS-14 dataset and the activityNet-1.2 dataset. The results of the evaluation are shown in tables 2 and 3, respectively, from which it can be seen that the motion localization accuracy of the present method on both data sets exceeds the previous mainstream method.

TABLE 2

TABLE 3

FIG. 2 shows the visualization results of the method of the present invention based on the prior best method HAM-Net. (a) The movements in the middle correspond to the weight lifting process of the athlete, and in two stages of 'picking up the barbell from the ground' (picture [1]) and 'lifting up the barbell to the top of the head' (picture [4]), the movement amplitude is larger, and the movement characteristics are more obvious; and there are significant scene cuts in the background [5], which can be easily located by the baseline method. However, during the weight lifting process, the player lifts the barbell upright and stops at the chest position (picture [2]), and the process is obviously switched to the lens (picture [3 ]). The process is difficult to capture without time supervision information, but the method of the invention can locate it completely. (b) The video comprises a plurality of golf actions, the 3 rd action segment is wholly slowly played, and the reference method can detect partial golf playing processes. But players tend to pause with power when the club is at the highest and lowest points (picture [1] [3] [5]), and because of the slow motion, the motion characteristics at these time positions are more blurred and difficult to distinguish from the static background. The positioning result shows that the method can solve the problem, and the positioning result of other actions is not influenced while the 3 rd action is completely positioned, so that the effectiveness of the method is fully embodied.

The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution of the present invention and the inventive concept thereof within the scope of the present invention.

Claims

1. A weak supervision time sequence action positioning method based on comparison learning is characterized by comprising the following steps: the method comprises the following steps:

1) constructing a feature extraction network and an action positioning network, wherein the action positioning network comprises two branches which respectively correspond to a classification model and a multi-branch attention model;

2) method and net for constructing staged weak supervision trainingLearning by the network only under the supervision of a video-level action category label, processing an original video sequence, respectively sending RGB data and optical flow data into a pre-trained feature extraction network to extract features, cascading to obtain video features X, then sending the video features X into a feature embedding model, mapping the feature embedding model to a feature space of a weak supervision time sequence action positioning task, and obtaining embedded features X_in；

3) To embed feature X_inInputting a classification model to obtain an original time domain class activation sequence F;

4) to embed feature X_inInputting a multi-branch attention model to obtain a significant action attention weight a_actAttention weight of fuzzy motion a_ambAnd a significant background attention weight a_bkdAnd three corresponding time domain activation sequences are constructed, namely a remarkable action time domain activation sequence CAS respectively_actFuzzy action time domain class activation sequence CAS_ambAnd significant background time-domain class activation sequence CAS_bkd(ii) a The output of the multi-branch attention model is the attention weight after normalization processing;

5) according to the attention weight after normalization processing, positive and negative sample pairs are constructed, and a fuzzy action contrast loss function L is calculated_conCombining the loss functions to calculate the total loss function L_totalAnd the network is converged by optimizing training;

6) during the test phase, the CAS sequence is activated for the time domain class_actAnd performing threshold segmentation to obtain a large number of action nominations, and finally removing redundant nominations by adopting a non-maximum suppression algorithm to obtain a final action positioning result.

2. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the feature extraction network in the step 1) adopts an I3D network pre-trained on a Kinetics data set, the 13D network does not participate in subsequent weak supervision training, and the classification model and the multi-branch attention model are built by adopting a time domain convolution network.

3. According to claimThe weak supervision time sequence action positioning method based on comparative learning, as claimed in claim 1, is characterized in that: step 2) the pre-trained feature extraction network is an I3D network, and the embedded feature X is_inThe calculation formula of (c) is:

X_in＝ReLU(Conv(X，θ_emb))

in the formula:

s is a characteristic dimension, T is a time dimension, θ_embModel parameters are embedded for trainable features, ReLU as an activation function.

4. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: in step 3)

F＝Conv(X_in，θ_cls)

In the formula, theta_clsAre trainable classification model parameters.

5. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the attention weight after the normalization processing in step 4):

Att＝Softmax(Conv(X_in，θ_att))

in the formula, theta_attFor the parameters of the trainable attention model,

6. the weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the loss function of the original time domain type activation sequence F is as follows:

in the formula:

for the probability that the video contains the action j,

l ∈ {1, 2,. and T }, | l | ═ k ═ max (1, T// r), r is a preset parameter, and j ═ 1, 2,. and c + 1;

the significant action temporal class activation sequence CAS_actThe loss function of (d) is:

in the formula:

k_act＝max(1，T//r_act)，r_actpresetting parameters;

the fuzzy action time domain class activation sequence CAS_ambThe loss function of (d) is:

in the formula:

k′_amb＝max(1，T//r′_amb)，r′_ambpresetting parameters;

the significant background time domain class activation sequence CAS_bkdThe loss function of (d) is:

in the formula:

k_bkd＝max(1，T//r_bkd)，r_bkdare preset parameters.

7. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: step 5) of

In the formula: τ is a temperature constant, x_act～X_act，

topk (k, x) is the time index of the k maxima in the truncation; x is the number of_bkd～X_bkd，

x_amb～X_amb，

k_amb＝max(1，T//r_amb)，r_ambIs a preset parameter and is used for controlling the sampling rate of the fuzzy characteristic.

8. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the specific method of the step 6) comprises the following steps: by CAS in the testing phase_actObtaining a category score p for a video level_actAnd setting a threshold value theta_clsAt p of_actMedium screening out higher than theta_clsAction category c_actThen to CAS_actIn class c_actObtaining a large number of action nominations by adopting a multi-threshold segmentation strategy on corresponding dimensions, and nominating the action (t)_s，t_e，c_act) The confidence score is calculated by the following formula

Wherein, t_sAnd t_eRespectively the start and end times of the motion,/_i＝(t_e-t_s) And 4, mu is a preset parameter, and finally, a non-maximum suppression algorithm is adopted to remove redundant nomination so as to obtain a final action positioning result.