CN114494941A - Comparison learning-based weak supervision time sequence action positioning method - Google Patents
Comparison learning-based weak supervision time sequence action positioning method Download PDFInfo
- Publication number
- CN114494941A CN114494941A CN202111610682.4A CN202111610682A CN114494941A CN 114494941 A CN114494941 A CN 114494941A CN 202111610682 A CN202111610682 A CN 202111610682A CN 114494941 A CN114494941 A CN 114494941A
- Authority
- CN
- China
- Prior art keywords
- action
- act
- video
- network
- amb
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2155—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Abstract
The invention discloses a weak supervision time sequence action positioning method based on contrast learning, which is used for positioning interested actions from un-clipped videos only under the supervision of a video-level action category label. Firstly, extracting video characteristics from RGB data and optical flow data of an original video by using a pre-trained characteristic extraction network, and sending the video characteristics into a subsequent action positioning network. The action positioning network comprises two branches, wherein one branch maps video characteristics into an original time domain class activation sequence (T-CAS); the other branch is a multi-branch attention model, models a significant action segment, a background segment and a fuzzy action segment in the video respectively, generates three corresponding time domain class activation sequences simultaneously, and enables the network to obtain the capability of separating action features and background features through a multi-example learning (MIL) mechanism. The invention can sense the accurate action time boundary in the un-edited video, avoid the occurrence of the truncation phenomenon of the complete action and greatly improve the action positioning precision.
Description
Technical Field
The invention belongs to the fields of computer vision, deep learning and the like, relates to a video positioning technology, and particularly relates to a weak supervision time sequence action positioning method based on contrast learning.
Background
In recent years, with the development of deep learning, the video understanding field has made a very significant breakthrough. The time sequence action positioning is taken as a research hotspot in the field of video understanding, and has great application potential in various real scenes, such as video monitoring, anomaly detection, video retrieval and the like. The main task is to pinpoint the start and end times of the action of interest from the long duration of the un-clipped video and to classify the action correctly. At present, the timing action positioning is mostly trained in a full supervision mode, and the key is to collect enough unfragmented videos labeled frame by frame. However, in the real world, marking massive video data frame by frame requires a large amount of manpower and material resources; in addition, due to the abstraction of the action, the time label of the artificial labeling action is easily influenced by subjective factors of people, so that the labeling information is wrong. Therefore, the time sequence action positioning based on weak supervised learning is derived, and only the action class label of the video level is used as the supervised information in the network training process. Compared with an accurate action time label, the action category label is easier to obtain, and the deviation caused by manual labeling can be effectively avoided.
The existing weak supervision time sequence action positioning method can be divided into two types: the method is inspired by a semantic segmentation technology, weak supervision time sequence action positioning and mapping are carried out to be an action classification problem, a action-background separation mechanism is introduced to construct video level characteristics, and finally a video is identified through an action classifier. And the other method is to express the time sequence action positioning as a multi-example learning task, regard the whole un-clipped video as a multi-example package simultaneously containing a positive sample and a negative sample, wherein the positive sample and the negative sample respectively correspond to an action segment and a background segment in the video, obtain a time domain class activation sequence through a classifier to further describe the probability distribution of the action on time, adopt top-k pooling to aggregate video-level class scores, and finally set a threshold value for the time domain class activation sequence to position the action.
The two methods solve the positioning problem in the un-edited video by learning an effective classification loss function, and although a certain effect can be obtained, similar to most weak supervision learning methods, due to the lack of time labels, the network is difficult to model a complete action generating process, the most significant part in the action can be over-concerned, and some secondary areas with unobvious features are ignored. Furthermore, since the video is not artificially clipped, there are often blurred frames of shot transitions, motion slow-release, etc. in a complete motion, which are semantically related to the motion and are part of the motion, but the motion features are not obvious, resulting in low activation values at these time positions, which are difficult to distinguish from a prominent background segment with the same low activation value, and which are falsely detected as background frames. Therefore, the fuzzy action characteristics in the video are found and refined, so that the network captures more complete action segments, and the method has important significance for improving the weak supervision time sequence action positioning performance.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a weak supervision time sequence action positioning method based on contrast learning. The feature extraction network and the action positioning network are trained separately, the remarkable action, the fuzzy action and the remarkable background in the video are modeled respectively through the multi-branch attention model, and the fuzzy action contrast loss function is introduced to refine the video features, so that the network senses more accurate time boundary, and the action positioning precision is effectively improved.
The invention adopts the following technical scheme for solving the technical problems:
firstly, extracting RGB (red, green and blue) features and optical flow features of an original video by adopting a pre-trained I3D network, cascading to obtain video features X, sending the video features X into a feature embedding model built by time domain convolution, and mapping the video features X to a feature space of a weak supervision time sequence action positioning task so as to learn more characteristic embedded features XinIt can be expressed by the following formula:
Xin=ReLU(Conv(X,θemb))
in the formula:s is a characteristic dimension, θembEmbedding models for trainable featuresThe parameter, ReLU, is the activation function. Then, two branches are designed in the action positioning network, namely a classification branch and an attention branch.
In the classification branch, a classification model is constructed through time domain convolution, and video features X are embeddedinMapping to action category feature space to obtain original time domain class activation sequenceAnd representing the probability distribution of the action in time, wherein c is the number of action categories, and the (c + 1) th dimension corresponds to the background category. This process can be expressed as:
F=Conv(Xin,θcls)
in the formula, thetaclsAre trainable classification model parameters. In order to enable a network to separate a significant background segment and a significant action segment and detect a fuzzy action segment in a video, the invention designs an attention model with three branches based on time domain convolution to respectively model the significant action, the significant background and the fuzzy action. The output of the model is the attention weightWherein a isact,aambAnd abkdCorresponding to the probability distribution over time of the salient motion, the fuzzy motion and the salient background, respectively. The specific process is as follows:
Att=Softmax(Conv(Xin,θatt))
in the formula, thetaattAre trainable attention model parameters. In order to distinguish the salient motion, the fuzzy motion and the salient background in the video features, based on the three attention weights and the original time domain class activation sequence F, a corresponding time domain class activation sequence CAS is constructedact、CASambAnd CASbkd. Wherein the content of the first and second substances,can be formulated as:
CASact=aact*F
similarly, the method for describing the fuzzy motion and the significant background can be obtained respectivelyAnd
in order to evaluate the loss of each time domain class activation sequence, the invention obtains a video level action class score by pooling the class activation values of the aggregated video segment through top-k, taking F as an example, and expressing the F as a formula:
in the formula: l ∈ {1, 2., T }, | l | ═ k ═ max (1, T// r), and r is a preset parameter. And finally, applying a Softmax function to the category dimension to obtain the category score of the video-level action, and calculating the classification loss by adopting a cross entropy function:
in the formula: j 1, 2, c +1,for the probability that the video contains the action j,a classification loss function for the original time-domain class activation sequence. Similarly, based on the CAS sequence of time domain class activationact、CASambAnd CASbkdCorresponding classification loss function can be obtainedAnd
in which a significant action temporal class activation sequence CASactThe loss function of (d) is:
in which the fuzzy action temporal class activation sequence CASambThe loss function of (d) is:
wherein the significant background time domain activation-like sequence CASbkdThe loss function of (d) is:
The above process makes it difficult to directly locate blurred motion segments in complex un-clipped video. Therefore, the invention designs a fuzzy motion contrast loss function to refine the video characteristics. Firstly, the methodAccording to the attention of the significant movement aactPooling on embedded features X with top-kinOn-capture salient motion features
In the formula: k is a radical ofact=max(1,T//ract) Is a hyperparameter of ractThe sampling rate of the salient motion features is controlled for the preset parameters. topk (k, x) is the time index of the k maxima in the truncation. By the same method, the obvious background feature can be obtained
Wherein the parameter is XactThe parameters of (a) are similar. Due to attention weight aambFocusing on the significant motion and the fuzzy motion at the same time, the fuzzy motion characteristics are difficult to directly acquire, and the significant motion weight is slightly larger than the fuzzy motion weight. Thus, in aambThe time indexes corresponding to the significant motion features and the significant background features are removed first. Is formulated as follows:
Wherein the parameter is XactThe parameters of (a) are similar. And finally, applying an InfonCE loss function to the video segment level, calculating the contrast loss of the fuzzy action, and refining the characteristics of the fuzzy action. Hypothesis selection of fuzzy motion featuresSalient motion featuresAnd salient background featuresIntroducing an InfonCE loss function:
in the formula: in the formula: τ is a temperature constant, xact~Xact,topk (k, x) is the time index of the k maxima in the truncation; kamb=max(1,T//ramb),rambis a preset parameter and is used for controlling the sampling rate of the fuzzy characteristic. k is a radical ofbkdFor hyper-parameters, for controlling the salient background features Xbkdτ is 0.07, which is a temperature constant. In addition to the above-mentioned loss function, the introduction of the L1 loss function ensures a significant action attention weight aactSparsity of (c):
finally, combining all loss functions to calculate the total loss function LtotalAnd the network is converged by optimizing training
Where α and β are the respective loss coefficients.
CAS in the testing phaseactMore accurate modeling of action profiles, and therefore by CASactObtaining a category score p for a video levelactAnd setting a threshold value thetaclsAt p ofactMedium screening out higher than thetaclsAction category c ofact. Then to CASactIn class cactAnd acquiring a large number of action nominations by adopting a multi-threshold segmentation strategy on the corresponding dimension. Nominating a certain action (t)s,te,cact) The confidence score is calculated by the following formula
Wherein, tsAnd teRespectively the start and end times of the motion,/i=(te-ts) Mu is a preset parameter. And finally, removing redundant nominations by adopting a non-maximum suppression algorithm to obtain a final action positioning result.
The invention has the following advantages and beneficial effects:
1. the invention provides a weak supervision time sequence action positioning method based on comparison learning. In the training process, only the action category labels at the video level are used as the monitoring information, and time labels for manually marking actions are not needed, so that the consumption of manpower and material resources is greatly reduced.
2. The method and the device respectively model the remarkable action, the fuzzy action and the remarkable background in the video through the multi-branch attention model, can effectively separate the action characteristic and the background characteristic in the video, and obviously improve the action positioning accuracy on different data sets.
3. The invention designs the fuzzy action comparison loss function, can refine the video characteristics under the guidance of the obvious characteristics, enables the network to sense more accurate time boundary, avoids the action positioning result from being truncated, and effectively improves the action positioning precision.
4. According to the invention, a better result compared with the current mainstream action positioning model can be obtained without introducing a cyclic neural network into the action positioning network, the defect that the cyclic neural network is easy to have gradient disappearance is overcome, the calculated amount of the network is reduced, and the training speed of the network is accelerated.
Drawings
Fig. 1 is a network structure of a weak supervised timing action positioning method based on contrast learning according to the present invention.
Fig. 2 is a visualization result display diagram according to an embodiment of the invention.
Detailed Description
The present invention will be described in further detail with reference to the following embodiments, which are illustrative only and not limiting, and the scope of the present invention is not limited thereby.
The invention relates to a weak supervision time sequence action positioning method based on comparison learning, which adopts a staged training mode of a feature extraction network and an action positioning network, models significant actions, fuzzy actions and significant backgrounds in a video by introducing a multi-branch attention model, effectively separates the action features and the background features in the video, introduces a fuzzy action comparison loss function, refines the video features under the guidance of the significant features, enables the network to perceive more accurate time boundaries, avoids the action positioning result from being truncated, and effectively improves the action positioning accuracy.
Fig. 1 is a network structure of the weakly supervised timing action positioning method based on contrast learning according to the present invention.
The whole framework of the invention mainly comprises two networks, including a feature extraction network and an action positioning network.
Wherein the feature extraction network adopts an I3D network pre-trained on a Kinetics data set as a main part. The network takes a 3D inclusion model as a backbone, and 4 pooling layers with the time step length of 2 are inserted between the pooling layers for controlling the parameter quantity of the network, so that the time sequence characteristics can be fused, the size of a receptive field can be reasonably controlled, and the loss of detail information can be prevented. The action positioning network consists of a feature embedding model, a classification model and a multi-branch attention model, and the three networks are all built by adopting a time domain convolution network so as to better capture the time sequence features of the video.
The data set adopted by the invention is a THUMOS-14 data set and an activityNet-1.2 data set. The thumb-14 dataset contains a total of 20 types of actions, with each video containing an average of 15.4 action segments, all data being obtained from the Youtube website. Video lengths vary from tens of seconds to tens of minutes, a challenging set of data for the task of weakly supervised timing action localization. As with the previous mainstream algorithmic partitioning of data sets, the present invention employs 200 verification videos with time stamps therein as a training set and 213 test videos as a test set. ActivityNet-1.2 is a large time series action positioning data set. In total, 100 classes of motion were included, the training set included 4819 videos, and the test set included 2382 videos. On average, each video contains 1.5 motion segments and 36% background, with a significantly reduced proportion of motion segments compared to the thumb-14 dataset.
Firstly, continuously dividing each 16 frames of an uncut video as a video segment to obtain T video blocks, sending the video blocks into a pre-trained feature extraction network to extract RGB (red, green, blue) features and optical flow features, and cascadingAnd obtaining video characteristics X, wherein the characteristic extraction network does not participate in subsequent weak supervision training. Then, the video features X are sent into a feature embedding model built by time domain convolution, and the feature embedding model is mapped to a feature space of a weak supervision time sequence action positioning task so as to learn more characteristic embedding featuress is the characteristic dimension and T is the characteristic dimension.
Next, it is necessary to utilize the embedding feature XinA time domain class activation sequence for a positioning action is obtained. Therefore, the invention designs two branches in the action positioning network, namely a classification branch and an attention branch.
In the classification branch, a classification model is constructed through time domain convolution, and video features X are embeddedinMapping to action category feature space to obtain original time domain class activation sequenceAnd representing the probability distribution of the action in time, wherein c is the number of action categories, and the (c + 1) th dimension corresponds to the background category.
However, with only the original temporal class activation sequence F, it is difficult for the network to separate the salient motion and salient background in the video. In order to enable a network to separate a significant background segment and a significant action segment and detect a fuzzy action segment in a video, an attention model with three branches is designed based on time domain convolution to respectively model the significant action, the significant background and the fuzzy action, attention weights of the three are obtained, and a Softmax function is adopted to carry out normalization processing on output results of the attention model. The output of the model is the attention weight Wherein a isact,aambAnd abkdCorresponding to the probability distribution over time of the salient motion, the fuzzy motion and the salient background, respectively. In particular toThe process is as follows:
Att=Softmax(Conv(Xin,θatt))
in the formula, thetaattAre trainable attention model parameters. In order to distinguish the salient motion, the fuzzy motion and the salient background in the video features, based on the three attention weights and the original time domain class activation sequence F, a corresponding time domain class activation sequence CAS is constructedact、CASambAnd CASbkd. Wherein the content of the first and second substances,can be formulated as:
CASact=aact*F
similarly, the method for describing the fuzzy motion and the significant background can be obtained respectivelyAndwherein, CASactThe method comprises the steps of having a higher activation value at a time position of a significant action in a video and being suppressed at a time position of a significant background; CAS (CAS)bkdWith higher activation values at temporal locations of significant background. Thus, based on CASactAnd CASbkdThe network can separate salient motion from salient background in the video. CAS (CAS)ambWith higher activation values at both the time locations of salient and blurred motion.
The invention aggregates the salient features in the video through a multi-example learning mechanism and supervises the training process of the network. Regarding the whole un-clipped video as a multi-instance package, each video segment is taken as an example, and each video segment can obtain the corresponding class activation value by the previous method. In order to evaluate the loss of each time domain class activation sequence, the invention obtains a video level action class score by pooling the class activation values of the aggregated video segment through top-k, taking F as an example, and expressing the F as a formula:
in the formula: l ∈ {1, 2., T }, | l | ═ k ═ max (1, T// r), and r is a preset parameter. And finally applying a Softmax function to the category dimension to obtain the category score of the video-level action, and calculating the classification loss by adopting a cross entropy function:
in the formula: j 1, 2, c +1,for the probability that the video contains the action j,a classification loss function for the original time-domain class activation sequence. Similarly, based on the CAS sequence of time domain class activationact、CASambAnd CASbkdCorresponding classification loss function can be obtainedAndthe invention regards a whole un-clipped video as a multi-example packet containing actions and backgrounds at the same time, and the category label of the original time domain class activation sequence is set as yj=1,y c+11. Second, to guarantee CASactAnd CASbkdCorresponding attention aactAnd abkdRespectively paying attention to the salient motion and the salient background in the video, and respectively setting the category labels of the salient motion and the salient background as yj=1,y c+10 and yj=0,y c+11. In addition, to locate blurred motion in video, the present inventionThe invention sets the CASambClass label of yj=1,y c+11, let aambThe method can focus on the remarkable action with a high activation value and the fuzzy action with a relatively low activation value in the video at the same time.
Although the above process can realize the separation of the action and the background by using a multi-branch attention model, the network lacks guidance of action time scale information, is difficult to directly position a fuzzy action segment in a complex un-clipped video, and cannot ensure the integrity of a positioning result. While the blurred motion segment tends to be temporally adjacent to the prominent motion segment and away from the prominent background segment. Furthermore, its attention weight will be slightly lower than the significant action attention weight, but significantly greater than the significant background attention weight. Based on the thought, the invention provides a simple and effective method for positioning the fuzzy action segment in the video, and designs the fuzzy action contrast loss function to refine the video characteristics, so that the network can be positioned to more complete action. First, according to the attention a of the significant movementactPooling on embedded features X with top-kinOn-capture salient motion features
In the formula: k is a radical ofact=max(1,T//ract) Is a hyperparameter of ractThe sampling rate of the salient motion features is controlled for the preset parameters. topk (k, x) is the time index of the k maxima in the truncation. By the same method, the obvious background feature can be obtained
Wherein the parameter is XactThe parameters of (a) are similar. Due to the fact thatGravity weight aambFocusing on the significant motion and the fuzzy motion at the same time, the fuzzy motion characteristics are difficult to directly acquire, and the significant motion weight is slightly larger than the fuzzy motion weight. Thus, in aambThe time indexes corresponding to the significant motion features and the significant background features are removed first. Is formulated as follows:
Wherein the parameter is XactThe parameters of (a) are similar. And finally, applying an InfonCE loss function to the video segment level, calculating the contrast loss of the fuzzy action, and refining the characteristics of the fuzzy action. And constructing a positive sample pair by using the significant motion features and the fuzzy motion features, and constructing a negative sample pair by using the significant background features and the fuzzy motion features, so that the significant motion and the fuzzy motion are driven to be more compact in a feature space, and the significant background and the fuzzy motion are far away from each other. Hypothesis selection of fuzzy motion featuresSalient motion featuresAnd salient background featuresIntroducing an InfonCE loss function:
in the formula: k is a radical ofbkdFor hyper-parameters, for controlling the salient background feature Xbkdτ is 0.07, which is a temperature constant. The loss function may maximize mutual information between the salient motion segments and the blurred motion segments. Therefore, in the process of each round of iterative training, the network continuously finds new fuzzy motion characteristics and compares the new fuzzy motion characteristics with the obvious characteristics, so that the characteristic information in the real motion range is richer, the identifiability of the characteristic distribution is improved, and the occurrence process of complete motion is captured. In addition to the above-mentioned loss function, the introduction of the L1 loss function ensures a significant action attention weight aactSparsity of (c):
finally, combining all loss functions to calculate the total loss function LtotalAnd the network is converged through Adam optimizer training:
where α and β are the respective loss coefficients.
CAS in the testing phaseactMore accurate modeling of action profiles, and therefore by CASactObtaining a category score p for a video levelactAnd setting a threshold value thetaclsAt p ofactMedium screening out higher than thetaclsAction category cact. Then to CASactIn class cactAnd acquiring a large number of action nominations by adopting a multi-threshold segmentation strategy on the corresponding dimension. Nominating a certain action (t)s,te,cact) The confidence score is calculated by the following formula
Wherein, tsAnd teRespectively the start and end times of the motion,/i=(te-ts) Mu is a preset parameter. And finally, removing redundant nomination by adopting a non-maximum suppression algorithm to obtain a final action positioning result.
The invention adopts a pytorch deep learning framework to carry out experiments, and the specific parameters are shown in the following table 1:
TABLE 1
The model was trained to converge and evaluated on the THUMOS-14 dataset and the activityNet-1.2 dataset. The results of the evaluation are shown in tables 2 and 3, respectively, from which it can be seen that the motion localization accuracy of the present method on both data sets exceeds the previous mainstream method.
TABLE 2
TABLE 3
FIG. 2 shows the visualization results of the method of the present invention based on the prior best method HAM-Net. (a) The movements in the middle correspond to the weight lifting process of the athlete, and in two stages of 'picking up the barbell from the ground' (picture [1]) and 'lifting up the barbell to the top of the head' (picture [4]), the movement amplitude is larger, and the movement characteristics are more obvious; and there are significant scene cuts in the background [5], which can be easily located by the baseline method. However, during the weight lifting process, the player lifts the barbell upright and stops at the chest position (picture [2]), and the process is obviously switched to the lens (picture [3 ]). The process is difficult to capture without time supervision information, but the method of the invention can locate it completely. (b) The video comprises a plurality of golf actions, the 3 rd action segment is wholly slowly played, and the reference method can detect partial golf playing processes. But players tend to pause with power when the club is at the highest and lowest points (picture [1] [3] [5]), and because of the slow motion, the motion characteristics at these time positions are more blurred and difficult to distinguish from the static background. The positioning result shows that the method can solve the problem, and the positioning result of other actions is not influenced while the 3 rd action is completely positioned, so that the effectiveness of the method is fully embodied.
The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution of the present invention and the inventive concept thereof within the scope of the present invention.
Claims (8)
1. A weak supervision time sequence action positioning method based on comparison learning is characterized by comprising the following steps: the method comprises the following steps:
1) constructing a feature extraction network and an action positioning network, wherein the action positioning network comprises two branches which respectively correspond to a classification model and a multi-branch attention model;
2) method and net for constructing staged weak supervision trainingLearning by the network only under the supervision of a video-level action category label, processing an original video sequence, respectively sending RGB data and optical flow data into a pre-trained feature extraction network to extract features, cascading to obtain video features X, then sending the video features X into a feature embedding model, mapping the feature embedding model to a feature space of a weak supervision time sequence action positioning task, and obtaining embedded features Xin;
3) To embed feature XinInputting a classification model to obtain an original time domain class activation sequence F;
4) to embed feature XinInputting a multi-branch attention model to obtain a significant action attention weight aactAttention weight of fuzzy motion aambAnd a significant background attention weight abkdAnd three corresponding time domain activation sequences are constructed, namely a remarkable action time domain activation sequence CAS respectivelyactFuzzy action time domain class activation sequence CASambAnd significant background time-domain class activation sequence CASbkd(ii) a The output of the multi-branch attention model is the attention weight after normalization processing;
5) according to the attention weight after normalization processing, positive and negative sample pairs are constructed, and a fuzzy action contrast loss function L is calculatedconCombining the loss functions to calculate the total loss function LtotalAnd the network is converged by optimizing training;
6) during the test phase, the CAS sequence is activated for the time domain classactAnd performing threshold segmentation to obtain a large number of action nominations, and finally removing redundant nominations by adopting a non-maximum suppression algorithm to obtain a final action positioning result.
2. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the feature extraction network in the step 1) adopts an I3D network pre-trained on a Kinetics data set, the 13D network does not participate in subsequent weak supervision training, and the classification model and the multi-branch attention model are built by adopting a time domain convolution network.
3. According to claimThe weak supervision time sequence action positioning method based on comparative learning, as claimed in claim 1, is characterized in that: step 2) the pre-trained feature extraction network is an I3D network, and the embedded feature X isinThe calculation formula of (c) is:
Xin=ReLU(Conv(X,θemb))
4. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: in step 3)
F=Conv(Xin,θcls)
In the formula, thetaclsAre trainable classification model parameters.
6. the weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the loss function of the original time domain type activation sequence F is as follows:
in the formula:for the probability that the video contains the action j,l ∈ {1, 2,. and T }, | l | ═ k ═ max (1, T// r), r is a preset parameter, and j ═ 1, 2,. and c + 1;
the significant action temporal class activation sequence CASactThe loss function of (d) is:
the fuzzy action time domain class activation sequence CASambThe loss function of (d) is:
the significant background time domain class activation sequence CASbkdThe loss function of (d) is:
7. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: step 5) of
8. The weakly supervised temporal action localization method based on contrast learning of claim 1, wherein: the specific method of the step 6) comprises the following steps: by CAS in the testing phaseactObtaining a category score p for a video levelactAnd setting a threshold value thetaclsAt p ofactMedium screening out higher than thetaclsAction category cactThen to CASactIn class cactObtaining a large number of action nominations by adopting a multi-threshold segmentation strategy on corresponding dimensions, and nominating the action (t)s,te,cact) The confidence score is calculated by the following formula
Wherein, tsAnd teRespectively the start and end times of the motion,/i=(te-ts) And 4, mu is a preset parameter, and finally, a non-maximum suppression algorithm is adopted to remove redundant nomination so as to obtain a final action positioning result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111610682.4A CN114494941A (en) | 2021-12-27 | 2021-12-27 | Comparison learning-based weak supervision time sequence action positioning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111610682.4A CN114494941A (en) | 2021-12-27 | 2021-12-27 | Comparison learning-based weak supervision time sequence action positioning method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114494941A true CN114494941A (en) | 2022-05-13 |
Family
ID=81495834
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111610682.4A Pending CN114494941A (en) | 2021-12-27 | 2021-12-27 | Comparison learning-based weak supervision time sequence action positioning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114494941A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030538A (en) * | 2023-03-30 | 2023-04-28 | 中国科学技术大学 | Weak supervision action detection method, system, equipment and storage medium |
CN116503959A (en) * | 2023-06-30 | 2023-07-28 | 山东省人工智能研究院 | Weak supervision time sequence action positioning method and system based on uncertainty perception |
-
2021
- 2021-12-27 CN CN202111610682.4A patent/CN114494941A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030538A (en) * | 2023-03-30 | 2023-04-28 | 中国科学技术大学 | Weak supervision action detection method, system, equipment and storage medium |
CN116503959A (en) * | 2023-06-30 | 2023-07-28 | 山东省人工智能研究院 | Weak supervision time sequence action positioning method and system based on uncertainty perception |
CN116503959B (en) * | 2023-06-30 | 2023-09-08 | 山东省人工智能研究院 | Weak supervision time sequence action positioning method and system based on uncertainty perception |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108830252A (en) | A kind of convolutional neural networks human motion recognition method of amalgamation of global space-time characteristic | |
CN109949317A (en) | Based on the semi-supervised image instance dividing method for gradually fighting study | |
CN108600865B (en) | A kind of video abstraction generating method based on super-pixel segmentation | |
CN114494941A (en) | Comparison learning-based weak supervision time sequence action positioning method | |
CN106682108A (en) | Video retrieval method based on multi-modal convolutional neural network | |
TWI712316B (en) | Method and device for generating video summary | |
CN110348364B (en) | Basketball video group behavior identification method combining unsupervised clustering and time-space domain depth network | |
CN106709453A (en) | Sports video key posture extraction method based on deep learning | |
CN108491766B (en) | End-to-end crowd counting method based on depth decision forest | |
CN106529477A (en) | Video human behavior recognition method based on significant trajectory and time-space evolution information | |
CN109886165A (en) | A kind of action video extraction and classification method based on moving object detection | |
CN112906631B (en) | Dangerous driving behavior detection method and detection system based on video | |
CN108509939A (en) | A kind of birds recognition methods based on deep learning | |
CN110210383A (en) | A kind of basketball video Context event recognition methods of fusional movement mode and key visual information | |
CN111462162A (en) | Foreground segmentation algorithm for specific class of pictures | |
CN114049581A (en) | Weak supervision behavior positioning method and device based on action fragment sequencing | |
CN115410119A (en) | Violent movement detection method and system based on adaptive generation of training samples | |
CN110968721A (en) | Method and system for searching infringement of mass images and computer readable storage medium thereof | |
CN111191531A (en) | Rapid pedestrian detection method and system | |
Zhao et al. | Action recognition based on C3D network and adaptive keyframe extraction | |
CN105893967B (en) | Human behavior classification detection method and system based on time sequence retention space-time characteristics | |
US20230290118A1 (en) | Automatic classification method and system of teaching videos based on different presentation forms | |
CN108491751A (en) | A kind of compound action recognition methods of the exploration privilege information based on simple action | |
Alwassel et al. | Action search: Learning to search for human activities in untrimmed videos | |
CN116935303A (en) | Weak supervision self-training video anomaly detection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |