CN116563953B

CN116563953B - Bottom-up weak supervision time sequence action detection method, system, equipment and medium

Info

Publication number: CN116563953B
Application number: CN202310830419.9A
Authority: CN
Inventors: 王子磊; 刘钦颖
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-10-20
Anticipated expiration: 2043-07-07
Also published as: CN116563953A

Abstract

The invention discloses a bottom-up weak supervision time sequence action detection method, a system, equipment and a medium, which are one-to-one corresponding schemes, wherein: the method has the advantages that the video frames are accurately distributed and analyzed through the frame-level clustering, then the front background classification is carried out on the clustering clusters through one cluster-level classification, so that the separation of the foreground frames and the background frames is indirectly realized, more fine-granularity supervision signals can be brought to the video time sequence action detection through the bottom-up modeling mode, the dependence on the video supervision signals is reduced, meanwhile, the accurate front background separation can be considered, and the class activation sequences and the frame-level attention weight obtained through the weak supervision video time sequence action detection model are combined, so that the video time sequence action detection can be accurately realized.

Description

Bottom-up weak supervision time sequence action detection method, system, equipment and medium

Technical Field

The invention relates to the field of video motion detection, in particular to a bottom-up weak supervision time sequence motion detection method, system, equipment and medium.

Background

In recent years, with the rapid development of video monitoring technology and artificial intelligence technology, video monitoring systems have been widely used in the fields of security, traffic, medical treatment, and the like. Video timing action detection is one of the important tasks in video monitoring systems, and can monitor and identify human behavior in video data in real time, such as pedestrian walking, vehicle driving and medical rehabilitation. However, the conventional video timing action detection method needs to manually label the video frames, and often needs professional domain knowledge and complex algorithm models, which limits the research and application scope of the video timing action detection method.

To solve these problems, the weakly supervised video timing action detection task is becoming a research hotspot. The task uses the action tags of each video for model training and optimization without labeling of each frame. However, the existing weak supervision video time sequence action detection method has the defect that only the whole video is marked, so that the existing research learns the class activation sequence by optimizing a video classification model and uses the class activation sequence as a position clue to generate a corresponding detection result. However, there is an inherent contradiction between classification and detection tasks on optimization targets, i.e. the classification task only pays attention to a few significant motion frames, resulting in inaccurate foreground-background separation, i.e. failure to accurately distinguish motion frames from non-motion frames. For example, chinese patent CN110516536B discloses a method for detecting weak surveillance video time sequence action based on complementary time sequence type activation pattern, which trains a video classification network to generate a class activation sequence, inputs the rest of insignificant video frames into the video classification network again to obtain a new complementary class activation sequence, and finally combines the two class activation sequences to obtain a more complete class activation sequence. However, the video classification model alone cannot guarantee that the scheme can accurately distinguish the foreground frame and the background frame from the rest of insignificant frames, so that the accuracy of detecting the video time sequence action is still to be improved.

Disclosure of Invention

The invention aims to provide a bottom-up weak supervision time sequence action detection method, a system, equipment and a medium, which can reduce dependence on video supervision signals and simultaneously can also consider accurate front-background separation so as to improve the accuracy of video time sequence action detection.

The invention aims at realizing the following technical scheme:

a bottom-up weak supervision time sequence action detection method comprises the following steps:

step 1, performing feature mapping on a frame-level feature sequence of a video to be detected through a weak supervision video time sequence motion detection model, and respectively generating a class activation sequence and a frame-level attention weight;

step 2, performing frame-level clustering on the frame-level feature sequence obtained by the feature mapping to obtain frame-level clustering allocation probability;

step 3, a frame-level feature sequence obtained by utilizing feature mapping is combined with frame-level clustering allocation probability and frame-level attention weight, and similarity between a prototype of a cluster and a prototype of a front background is calculated to obtain cluster-level classification probability;

and 4, calculating frame-level foreground probability by utilizing the frame-level clustering allocation probability and the cluster-level classification probability, and calculating an action detection result by combining the class activation sequence and the attention weight of the frame level.

A bottom-up weak supervision timing action detection system comprising:

the weak supervision video time sequence action detection model is used for carrying out feature mapping on a frame-level feature sequence of a video to be detected, and generating a class activation sequence and a frame-level attention weight respectively;

the frame-level clustering module is used for carrying out frame-level clustering on the frame-level feature sequences obtained by the feature mapping to obtain frame-level clustering allocation probability;

the cluster classification module is used for obtaining a frame-level feature sequence by utilizing feature mapping, and calculating the similarity between a prototype of a cluster and a prototype of a front background by combining the frame-level clustering allocation probability and the attention weight of the frame level to obtain a cluster classification probability;

and the action detection module is used for calculating frame-level foreground probability by utilizing the frame-level clustering allocation probability and the cluster-level classification probability, and calculating an action detection result by combining the class activation sequence and the attention weight of the frame level.

A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.

According to the technical scheme provided by the invention, the video frames are accurately distributed and analyzed through the frame-level clustering, and then the front background classification is carried out on the clustering clusters through one cluster-level classification, so that the separation of the foreground frames and the background frames is indirectly realized, the bottom-up modeling mode can bring more fine-grained monitoring signals for the detection of the video time sequence action, so that the dependence on the video monitoring signals is reduced, meanwhile, the accurate front background separation can be considered, and the class activation sequence and the frame-level attention weight obtained by the weak monitoring video time sequence action detection model are combined, so that the detection of the video time sequence action can be accurately realized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a bottom-up weak supervision timing action detection method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a core idea of a bottom-up weak supervision timing action detection method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a network model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a bottom-up weak supervision timing action detection system according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The term "consisting of … …" is meant to exclude any technical feature element not explicitly listed. If such term is used in a claim, the term will cause the claim to be closed, such that it does not include technical features other than those specifically listed, except for conventional impurities associated therewith. If the term is intended to appear in only a clause of a claim, it is intended to limit only the elements explicitly recited in that clause, and the elements recited in other clauses are not excluded from the overall claim.

The following describes in detail a method, a system, a device and a medium for detecting a bottom-up weak supervision time sequence action. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer.

Example 1

The embodiment of the invention provides a bottom-up weak supervision time sequence action detection method, which mainly comprises the following steps as shown in fig. 1:

and step 1, performing feature mapping on a frame-level feature sequence of a video to be detected through a weak supervision video time sequence motion detection model, and respectively generating a class activation sequence and a frame-level attention weight.

In the embodiment of the invention, the weak supervision video time sequence action detection model can be realized by adopting a traditional scheme, and comprises two branches, wherein the first branch (video classification branch) generates a class activation sequence, and the second branch (attention branch) generates a frame-level attention weight.

And step 2, performing frame-level clustering on the frame-level feature sequence obtained by the feature mapping to obtain frame-level clustering allocation probability.

In the embodiment of the invention, the frame-level feature sequence obtained by the feature mapping is subjected to frame-level clustering by a K-class cluster classifier to obtain the K-dimensional frame-level cluster allocation probabilityThe n-th row and k-th column elements +.>Representing the predicted probability of the n-th frame being assigned as the kth cluster, k=1, …, K being the number of clusters.

And 3, utilizing the frame-level feature sequence obtained by the feature mapping, and combining the frame-level clustering allocation probability and the attention weight of the frame level to calculate the similarity between the prototype of the cluster and the prototype of the front background so as to obtain the cluster class classification probability.

In the embodiment of the invention, a frame-level feature sequence and a frame-level clustering allocation probability obtained by combining feature mapping are combined to calculate a prototype of each cluster; taking the attention weight of the frame level as the foreground probability of the frame level, and calculating a prototype of the foreground and a prototype of the background by combining a frame level feature sequence obtained by feature mapping; and respectively calculating the similarity of the prototype of each cluster and the foreground and the prototype of the background to obtain cluster classification probability, namely the probability that each cluster belongs to the foreground and the background.

In the embodiment of the present invention, the frame-level feature sequences obtained by the feature mapping used in step 2 and step 3 are all from the attention branches.

The preferred embodiment of this step is as follows:

(1) And calculating the frame-level foreground probability by using the frame-level clustering allocation probability and the cluster-level classification probability.

(2) And carrying out weighted summation on the class activation sequence, the attention weight of the frame level and the frame level foreground probability to obtain the modulated class activation sequence.

(3) Multiple thresholds are set on the modulated class activation sequence, resulting in multiple action nomination.

(4) And removing redundant action nomination through non-maximum value inhibition, wherein the rest action nomination is an action detection result.

According to the scheme provided by the embodiment of the invention, the video frames are accurately distributed and analyzed through the frame-level clustering, and then the front background classification is carried out on the clustered clusters through one cluster-level classification, so that the separation of the foreground frames and the background frames is indirectly realized, the bottom-up modeling mode can bring more fine-grained monitoring signals to the video time sequence motion detection, so that the dependence on the video monitoring signals is reduced, meanwhile, the accurate front background separation can be considered, and the class activation sequence and the frame-level attention weight obtained by the weak monitoring video time sequence motion detection model are combined, so that the video time sequence motion detection can be accurately realized.

In order to more clearly demonstrate the technical scheme and the technical effects provided by the invention, the method provided by the embodiment of the invention is described in detail below by using specific embodiments.

1. And constructing a network model.

In the embodiment of the invention, in order to solve the problem that the front background is difficult to separate due to classification and detection task difference in the time sequence action detection of the weak surveillance video, a network model comprising a basic model, a frame-level clustering module and a cluster-level classification module is constructed. The underlying model herein is intended to train a video classification model, such as the weak surveillance video timing motion detection model described previously, in accordance with conventional methods. The frame-level clustering module (performing the aforementioned step 2) aims to gather video frames into several potential clusters in a self-supervision manner, thereby describing the distribution of the frames in the feature space more accurately. The cluster classification module (executing the step 3) is used for classifying each cluster into foreground and background in a self-supervision manner, so that the learned mapping relation between each cluster and the foreground and background category. After training, the invention can realize front background frame separation in a bottom-up mode, namely, the frame-level clustering probability is converted into frame-level front background prediction probability by combining the prediction results of the frame-level clustering module and the cluster-level classification module, and then the action detection result is calculated by combining the class activation sequence generated by the basic model and the attention weight of the frame level.

Fig. 2 illustrates the core concept of the present invention, specifically, first dividing each frame into different clusters using frame-level clustering, and then dividing each cluster into a foreground and a background. By means of the two-step cascade, the front background separation of the frames can be achieved in a bottom-up manner.

As shown in fig. 3, the entire framework of the network model is illustrated. The base model contains two branches: one is the video classification branch and one is the attention branch. Wherein the video classification branch is used to train a video classification network to learn class activation sequences, and the attention branch is used to predict the attention weight per frame, i.e., the foreground probability per frame. However, this basic model has two problems. 1) The video classification branch only focuses on the foreground frames of significance, and cannot fully mine the distribution of all frames. 2) Attention branches simply divide foreground frames into two classes, failing to adequately characterize the intra-class distribution of foreground and background. The invention introduces a frame-level clustering module and a cluster-level classification module to solve the above problems. The frame-level clustering module is used for predicting the cluster allocation probability of each frame, and the cluster-level classification module is used for further predicting the probability that each cluster belongs to the foreground and the background. The probability of the front background of the frame level can be easily deduced by combining the cluster allocation probability of the frame level and the front background classification probability of the cluster level, so that the method can be used for distinguishing the front background frames in the auxiliary test process.

2. The network model is trained.

1. And (5) a basic model.

As previously described, the base model may use a weak surveillance video timing action detection model that includes a video classification branch, an attention branch. During training, firstly, extracting frame-level features of a video to form a frame-level feature sequence, and then inputting the frame-level feature sequence into a basic model. Feature mappers are arranged in both branches of the basic model, the rear parts of the feature mappers of the video classification branches are connected with a video level classifier to generate a class activation sequence, video level classification results are obtained after aggregation, and video level classification losses are calculated. The attention layer is connected behind the feature mapper of the attention branch, generates the attention weight of the frame level, and calculates the attention weight constraint loss. The overall loss of the base model is formed by the video level classification loss and the attention weight constraint loss.

In the embodiment of the invention, the training data has the video-level action label, so that the video-level classification loss can be directly calculated by combining the label and the video-level classification result.

In the embodiment of the invention, the process of calculating the attention weight constraint loss is as follows: the video frames are ordered from large to small in the attention weight at the frame level, the labels of the first r% frames are set to be 1, the labels of the remaining frames are set to be 0, where r is a predefined ratio, typically set to 50. Then, the attention weight constraint loss is calculated using the attention weight at the frame level and the label set as described above.

2. And a frame-level clustering module.

The frame-level clustering module mainly predicts the cluster allocation of each frame by using a K-class learning linear classifier (cluster classifier), so as to obtain the K-dimensional cluster allocation probability of each frame. Because no real label corresponding to the cluster allocation exists, the method adopts a pseudo label technology based on optimal transmission to generate the corresponding pseudo label for the cluster allocation probability, thereby being capable of being used for supervising the optimization process of the cluster allocation probability.

Specifically, the invention first places a class K cluster classifier on the attention branch of the base model. K is a preset hyper-parameter representing the number of potential clusters. During each training iteration, the characteristic sequence of N frames is adoptedInputting the K-class cluster classifier to obtain K-dimensional frame-level cluster allocation probability, wherein ,/>For the real set sign, D is the dimension of the feature, here the feature sequence +.>A frame-level feature sequence obtained by feature mapping; />The number of rows N of (a) is equal to the total number of frames N, the number of columns K is equal to the number of clusters K,/-, etc>N < th > row->Representing the frame-level cluster allocation probability of the nth frame, each element of which represents the probability of allocating the nth frame to the corresponding cluster. Frame-level clustering allocation probability for optimizing K dimension>The invention designs a pseudo tag technology based on optimal transmission as +.>Generating its corresponding pseudo tag. The problem of generating the pseudo tag can be regarded as an optimal transmission problem, namely, the process of distributing the pseudo tag distributes N frame features to the centers of K clusters with minimum transmission cost.

Theoretically, the computation cost to solve this optimal transmission problem is high. Thus, the sink horn-Knopp algorithm (which is an algorithm for calculating the optimal transmission) is used to quickly solve the approximate solution of the optimal transmission problem. First assume a pseudo tag for N frames asThen in the sink horn-Knopp algorithm the optimal pseudo tag +.>The calculation can be performed by the following form:

；

wherein ,is to assign probabilities to frame-level clusters of N frame predictions (e.g., the aforementioned +.>) Diag is a function for constructing a diagonal matrix, ++>Is a predefined constant that is used to control the convergence of the algorithm and, in the present invention,is set to 10./> and />Is two vectors for guaranteeing +.>Is a probability distribution. They are iteratively updated by the following two formulas:

；

after a few iterations, the algorithm converges quickly, so that the optimal algorithm can be obtained and />Further calculate the optimal +.>。

According to the description of the Sinkhorn-Knopp algorithm principle, the K-dimensional frame-level clustering allocation probability can be rapidly calculatedCorresponding pseudo tag->：

；

wherein ,and->Is two vectors (obtained by iterative updating of the formula) for guaranteeing pseudo tag corresponding to frame-level cluster allocation probability>Is a probability distribution where pseudo tags are +.>The nth row and the kth column elements are recorded asRepresenting the probability that the nth frame is assigned to the kth cluster.

Thereafter, the probability can be assigned using frame-level clusteringAnd corresponding pseudo tag->And calculating the first classification loss, and further optimizing the frame-level clustering module.

In the embodiment of the invention, the first classification loss is calculated by adopting cross entropy loss, and the formula is as follows:

；

wherein ,is a first classification penalty.

3. Cluster class classification modules.

The cluster classification module is used for classifying each cluster into a foreground and a background. Inspired by prototype learning (prototype learning), the invention first calculates a prototype of each cluster and a prototype of the front background, and then obtains cluster classification probabilities by calculating the similarity between the prototype of the cluster and the prototype of the front background. In particular, for a cluster prototype, probabilities can be assigned according to frame-level clustersTo aggregate the features of frames belonging to the respective clusters to obtain prototypes. For example, for the kth cluster, its prototype can be calculated by:

；

wherein ,representing frame-level features obtained by the feature map corresponding to the nth frame,/th frame>Representing a prototype of the kth cluster. In a similar way, the prototype +_for the front background can also be calculated>Which comprises a prototype of the foreground +>And prototype of background->。

The attention weight at the frame level of the base module can be regarded as a foreground probability at the frame level, so that the features of frames belonging to the foreground can be aggregated by using the attention weight at the frame level to obtain a prototype of the foreground：

；

wherein ,is the attention weight of the nth frame, i.e. the probability that the nth frame belongs to the foreground. Obviously, the->Representing the probability that the nth frame belongs to the background. Thus the prototype of the background can calculate +.>The method comprises the following steps:

。

thereafter, by calculationSimilarity between prototype of each cluster and prototype of the foreground to obtain cluster classification probabilityIt contains K rows, each row containing the prediction probabilities that the corresponding cluster belongs to the foreground and the background. For the kth cluster, the probability that it belongs to the foreground and the background can be calculated by the following formula:

；

where i=1, 2, i=1 denotes foreground, i=2 denotes background, i.eRepresenting the predicted probability that the kth cluster belongs to either the foreground (i=1) or the background (i=1), cos is the cosine similarity function, softmax is the normalized exponential function,for the scaling factor, 10 is typically set.

Since the training process does not have a real label corresponding to the foreground and background of each cluster, in order to optimize the cluster classification module, a pseudo label needs to be generated to monitor the cluster classification probabilityIs a learning object of (a). In the embodiment of the invention, similar to the frame-level clustering module, the pseudo tag technology based on optimal transmission is again adopted to generate cluster classification probability +.>Corresponding pseudo tagI.e. the near optimal solution for optimal transmission is iteratively obtained by means of the sink horn-Knopp algorithm as shown below:

；

wherein ,and->Is two vectors (obtained by iterative updating of the formula) for guaranteeing pseudo tag corresponding to cluster classification probability>Is a probability distribution. Here pseudo tag->The kth row and ith column element of (2) is denoted +.>Representing the probability that the kth cluster belongs to the ith class, as well, i=1 refers to foreground and i=2 refers to background.

And then, the cluster classification probability and the second classification loss calculated by the corresponding pseudo tag can be utilized to optimize the cluster classification module.

Similarly, the second classification loss is also calculated using the cross entropy loss, as follows:

；

wherein ,is a second classification loss.

And finally, performing end-to-end joint training by taking the video classification loss, the attention weight constraint loss, the first classification loss and the second classification loss obtained by aggregation calculation as a total loss function during training, so as to realize collaborative training and mutual promotion of all modules, wherein the training of the part is mainly to optimize the internal parameters of a basic model and a frame-level clustering module.

3. And detecting time sequence actions.

After training, according to the figure 1Class activation sequence and frame-level attention weights, and frame-level cluster allocation probabilities are generated by the flow of (a)Probability of classifying with cluster->. Then, the frame-level pre-background probability ++is calculated using the following full probability formula theorem>：

；

Wherein the frame-level foreground-background probabilityThere are N rows, each row containing probabilities that the corresponding frame belongs to the foreground and the background, i.e., frame-level foreground probabilities and frame-level background probabilities.

Thereafter, the frame-level foreground probability can be combined with the class activation sequence and the frame-level attention weight to calculate an action detection result.

For ease of understanding, a specific example is provided below for illustration.

In this example, weak surveillance video timing motion detection uses only a video-level tag surveillance model to accomplish the localization and classification of motion in video. A basic model is built based on the existing weak surveillance video timing motion detection scheme, and then a frame classification module and a cluster classification module are set. In this way, our method can reduce the dependence on video-level supervisory signals, enabling more accurate foreground-background separation.

Step 1, preparing a data set.

In this example, a training set needs to be specified, which contains some long videos with video-level action annotations. A test set with complete labels is then prepared, containing some long videos and class labels corresponding to each frame for evaluation of the index.

And 2, building a network model.

In this example, the network structure is built up according to the framework shown in fig. 3. Specifically, features are first extracted from video frames at 16 frame intervals using a pre-trained feature extractor I3D (infted 3D ConvNet, expanded three-dimensional convolutional network). The reason for using I3D is that it is excellent in recent motion recognition field and has been widely used for feature extraction in video timing motion detection field. The subsequent network framework mainly comprises a basic model, a frame clustering module and a cluster classification module.

In this example, the previous weakly supervised video timing action detection framework ASL may be employed as a base model, since it has good performance in terms of both accuracy and efficiency. Specifically, it comprises two branches: video-level classification branches and attention branches. Wherein the video level classification branch comprises a feature mapper and a video classifier. The number of categories of the classifier is the number of all action categories. The attention branch contains an attention layer that is composed of another feature mapper and a single channel. A linear cluster classifier is used in the frame-level clustering module to learn the cluster allocation at the frame level, while the cluster-level clustering module does not introduce additional learnable parameters and structures.

And 3, training a network model. During each training iteration, 16 videos were randomly extracted from the training dataset as one training batch. After extracting the post-feature using the I3D network described above, 750 frame-level features are randomly extracted from each video. These 750 frame-level features make up the video sequence for input to the video classification branch and the attention branch of the base model, respectively. The basic model is trained in a manner similar to existing schemes to optimize video classification branches and attention branches, respectively, by minimizing a video classification loss and an attention weight constraint loss. On the attention branch, features of the feature mapper are input to the frame-level clustering module and the cluster-level classification module, respectively. The frame-level clustering module predicts frame-level clustering allocation probability through a cluster classifier, and the cluster-level classification module firstly calculates cluster-level prototype features and foreground prototype features from input features and calculates cluster-level classification probability according to the cluster-level prototype features and the foreground prototype features. The frame-level clustering module is trained by cross entropy loss between the frame-level clustering allocation probability and the corresponding pseudo tag, and the cluster-level classification module is trained by cross entropy loss between the cluster-level classification probability and the corresponding pseudo tag. And adding all losses to perform joint training, namely performing end-to-end training on the whole network model.

In this example, during the training phase, SGD (random gradient descent) was used as an optimizer to minimize the loss function, learning rate was 0.0001, and 200 epochs were trained altogether.

And 4, testing the network model.

A test set is prepared containing complete frame-level labels, and each video is input into the network model in turn. Wherein the video classification branches of the base model generate class activation sequences and the attention branches generate frame-level attention weights. The frame-level foreground probability is calculated by fusing the frame-level cluster probability and the cluster-level classification probability using the scheme described above. And then, carrying out weighted summation on the class activation sequence, the frame-level attention weight and the frame-level foreground probability to obtain the modulated class activation sequence. Finally, similar to conventional weakly supervised timing action detection methods, multiple thresholds (e.g., 0.1,0.2,0.3, …, 0.9) are set on the modulated class activation sequence, resulting in a series of action profile. And finally, removing redundant action nomination by using a Non-maximum suppression (Non-maximum Suppression, NMS) algorithm to obtain a final action detection result.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

Example two

The invention also provides a bottom-up weak supervision time sequence action detection system, which is mainly used for realizing the method provided by the previous embodiment, as shown in fig. 4, and mainly comprises the following steps:

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

Example III

The present invention also provides a processing apparatus, as shown in fig. 5, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.

In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

the output device may be a display terminal;

the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.

Example IV

The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A bottom-up weak supervision time sequence action detection method is characterized by comprising the following steps:

step 1, performing feature mapping on a frame-level feature sequence of a video to be detected through a weak supervision video time sequence motion detection model, generating a class activation sequence by using the frame-level feature sequence obtained by using the feature mapping through a video classification branch, and generating a frame-level attention weight by using the frame-level feature sequence obtained by using the feature mapping through an attention branch;

2. The method for detecting a bottom-up weak supervision time sequence action according to claim 1, wherein the performing frame-level clustering on the frame-level feature sequence obtained by the feature mapping to obtain a frame-level cluster allocation probability comprises:

performing frame-level clustering on the frame-level feature sequences obtained by the feature mapping through a K-class cluster classifier to obtain K-dimensional frame-level cluster allocation probabilityThe n-th row and k-th column elements +.>Representing the predicted probability of the n-th frame being assigned as the kth cluster, k=1, …, K being the number of clusters.

3. The method for detecting a bottom-up weak supervision time series action according to claim 1, wherein the obtaining the cluster-level feature sequence by using the feature map, and combining the frame-level cluster allocation probability and the attention weight of the frame level, calculating the similarity between the prototype of the cluster and the prototype of the foreground, and obtaining the cluster-level classification probability comprises:

combining the frame-level feature sequence and the frame-level cluster allocation probability obtained by the feature mapping, and calculating a prototype of each cluster; taking the attention weight of the frame level as the foreground probability of the frame level, and calculating a prototype of the foreground and a prototype of the background by combining a frame level feature sequence obtained by feature mapping;

and respectively calculating the similarity of the prototype of each cluster and the foreground and the prototype of the background to obtain cluster classification probability, namely the probability that each cluster belongs to the foreground and the background.

4. The bottom-up weak supervision timing action detection method according to claim 1, wherein the step 2 is performed by a frame-level clustering module, the step 3 is performed by a cluster-level classification module, the weak supervision video timing action detection model, the frame-level clustering module and the cluster-level classification module form a network model, the network model is trained in advance, and a total loss function during training includes: the method comprises the steps of generating a class activation sequence and a video-level classification loss and attention-weight constraint loss calculated by using a weak supervision video time sequence motion detection model, a first classification loss calculated by using a frame-level clustering allocation probability and a corresponding pseudo tag, and a second classification loss calculated by using a cluster-level classification probability and a corresponding pseudo tag.

5. The bottom-up weak supervision timing action detection method according to claim 4, wherein the pseudo tags corresponding to the frame-level clustering allocation probabilities and the pseudo tags corresponding to the cluster-level classification probabilities are generated by a pseudo tag technique based on optimal transmission.

6. The method for bottom-up weak supervision timing action detection as defined in claim 5, wherein,

the generation mode of the pseudo tag corresponding to the frame-level cluster allocation probability is expressed as follows:

；

wherein ,representing frame-level cluster allocation probability,/->And->Is two vectors for guaranteeing pseudo tag corresponding to frame level cluster allocation probability>Is a probability distribution->For a predefined constant, diag is a function used to construct a diagonal matrix;

the generation mode of the pseudo tag corresponding to the cluster classification probability is expressed as follows:

；

wherein ,representing cluster class classification probability,/->And->Is two vectors for guaranteeing pseudo tag corresponding to cluster classification probability>Is a probability distribution.

7. The method for detecting a bottom-up weak supervision time series action according to claim 1, wherein the calculating a frame-level foreground probability by using a frame-level cluster allocation probability and a cluster-level classification probability, and combining the class activation sequence and the attention weight of the frame level, the calculating an action detection result comprises:

calculating frame-level foreground probability by using the frame-level clustering allocation probability and the cluster-level classification probability;

the class activation sequence, the attention weight of the frame level and the foreground probability of the frame level are weighted and summed to obtain a modulated class activation sequence;

setting a plurality of thresholds on the modulated class activation sequence to generate a plurality of action nominations;

and removing redundant action nomination through non-maximum value inhibition, wherein the rest action nomination is an action detection result.

8. A bottom-up weak supervision timing action detection system, comprising:

the weak supervision video time sequence action detection model is used for carrying out feature mapping on a frame-level feature sequence of a video to be detected, generating a class activation sequence by a frame-level feature sequence obtained by a video classification branch through feature mapping, and generating a frame-level attention weight by an attention branch through the frame-level feature sequence obtained by the feature mapping;

9. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-7.