CN116030538A

CN116030538A - Weak supervision action detection method, system, equipment and storage medium

Info

Publication number: CN116030538A
Application number: CN202310326090.2A
Authority: CN
Inventors: 王子磊; 李志林
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-04-28
Anticipated expiration: 2043-03-30
Also published as: CN116030538B

Abstract

The invention discloses a weak supervision action detection method, a weak supervision action detection system, weak supervision action detection equipment and a weak supervision action detection storage medium, which are corresponding schemes, wherein: according to the action prediction structures of the two different branches, fuzzy video frames with inconsistent prediction results are screened out, and feature learning of the inconsistent video frames can be promoted by using contrast learning; the consistency constraint is carried out on the double branches to reduce the difference of the double branches, so that the problem that enough video frames cannot be maintained for contrast learning due to overlarge difference of the double branches is solved, and meanwhile, the reliability of the action detection result is further enhanced by utilizing collaborative learning of the double branches; the influence caused by pseudo tag errors when training the class-independent actionable prediction branch can be reduced by applying the consistency constraint in the class-independent actionable detection branch; in general, the invention improves the detection performance of the motion detection network and ensures the accuracy and reliability of motion detection by utilizing contrast learning and consistency constraint.

Description

Weak supervision action detection method, system, equipment and storage medium

Technical Field

The present invention relates to the field of video motion detection technologies, and in particular, to a method, a system, an apparatus, and a storage medium for detecting a weak supervision motion.

Background

In recent years, with increasing capabilities of video data acquisition, transmission and storage, demands of video analysis applications are gradually increased, and time sequence motion detection tasks are also gradually focused by many researchers at home and abroad due to wide practical applications, such as security monitoring, video retrieval, sports video clip, video auditing, and the like, so that the researchers need to design proper time sequence motion detection schemes according to demands of different application scenes, and provide accurate motion positioning and classification results.

At present, the time sequence motion detection task mainly has two learning paradigms: (1) providing a full-supervision learning paradigm of complete frame-level labeling; (2) A weakly supervised learning paradigm of any frame level annotation but known video level class annotation is not provided. For the full-supervision time sequence action detection, all sample videos need to be manually marked frame by frame, the marking work is time-consuming, and the precision is generally low. Therefore, in order to solve the problems of difficult labeling and large labeling error, weak supervision time sequence action detection is generated. This method can label a large number of unclamped videos on the network with a video level, and directly serve as training data.

Although weakly supervised timing action detection has many benefits, its performance is generally poor due to the lack of fine labeling. The weak supervision time sequence action detection at the present stage faces two basic challenges: action integrity modeling and action context confusion. The current mainstream method at home and abroad is to train a classification network by directly utilizing video-level labels so as to obtain a class activation sequence, then select a series of preselected frames on the class activation sequence by utilizing a plurality of thresholds, and finally obtain a final prediction result by utilizing non-maximum suppression. The root cause of these two problems is that without frame-level label supervision, the network model has difficulty in learning the real motion information, so that the network model generates wrong or fuzzy prediction results for the scene strong-correlation fragments and the fragments of the motion edges. The strategy of cooperative distillation is adopted in the Chinese patent application of weak supervision video timing sequence action detection and classification method and system with publication number of CN115272941A, so that the advantages of a single-mode frame and a cross-mode frame are complemented, and more complete and accurate timing sequence action detection and classification are realized. The action recognition is performed by using a sparse key frame attention mechanism in PCT international application published as CN110832499a, which enters the national phase, weak supervision action localization through sparse time pooled networks. In the chinese patent application CN115439790a, the method for positioning weak supervision time sequence action based on cascaded seed region growth module, an original class activation sequence is obtained according to time sequence characteristics, an expanded class activation sequence is obtained by a seed growth strategy, an anti-erasure is performed, the original class activation sequence and the erased class activation sequence are fused, and a class activation sequence with higher reliability is obtained to improve detection accuracy. In a chinese patent application CN114898259a, a weak supervision video timing sequence action positioning method based on action associated attention, an action associated attention model is used to establish a relationship between action segments in a video, a query mechanism is used to establish weak supervision pre-training, and the output of the query mechanism is input into a decoder of a transducer architecture to implement time positioning of a query set; and determining the relation between the video segment characteristics by utilizing an encoder of a transducer architecture, so as to realize the positioning and classification of the action segments. None of the above methods takes into account the influence of the blurred video frames in the video on the network model detection performance, so the detection performance is still to be improved.

In view of this, the present invention has been made.

Disclosure of Invention

The invention aims to provide a weak supervision action detection method, a weak supervision action detection system, weak supervision action detection equipment and a weak supervision action detection storage medium, which can greatly improve the accuracy and reliability of action detection.

The invention aims at realizing the following technical scheme:

a method of weakly supervised action detection, comprising:

constructing an action detection network comprising a category-aware action detection branch and a category-independent action detection branch, and training the action detection network; during training, inputting optical flow characteristics and RGB characteristics of a training video, obtaining a class activation sequence through a class perception action detection branch, generating class perception action scores, obtaining optical flow characteristic action scores and RGB characteristic action scores through a class irrelevant action detection branch, and fusing the optical flow characteristic action scores and the RGB characteristic action scores into class irrelevant action scores; dividing video frames in the training video according to the category perception action score and the category irrelevant action score, constructing corresponding positive and negative samples according to the type of each frame, and calculating comparison loss by utilizing embedded features respectively generated in the category perception action detection branch and the category irrelevant action detection branch to obtain total comparison loss; calculating class perception losses according to the class activation sequences, and calculating class irrelevant losses according to the optical flow characteristic action scores and the RGB characteristic action scores; respectively applying consistency constraint to the interior of the class-independent action detection branch and between the class-aware action detection branch and the class-independent action detection branch, and calculating corresponding consistency loss; constructing a target loss function by combining the total contrast loss, the category perception loss, the category irrelevant loss and the consistency loss, and training an action detection network by utilizing the target loss function; wherein RGB represents three color channels of red, green and blue;

Inputting the optical flow characteristics and RGB characteristics of the video to be detected into a trained action detection network, and generating an action detection result by combining the class activation sequence of the class-aware action detection branch and the optical flow characteristic action score and RGB characteristic action score of the class-independent action detection branch.

A weak supervision action detection system comprising:

the system comprises an action detection network construction and training unit, a classification sensing unit and a classification independent action detection unit, wherein the action detection network construction and training unit is used for constructing an action detection network comprising a classification sensing action detection branch and a classification independent action detection branch and training the action detection network; during training, inputting optical flow characteristics and RGB characteristics of a training video, obtaining a class activation sequence through a class perception action detection branch, generating class perception action scores, obtaining optical flow characteristic action scores and RGB characteristic action scores through a class irrelevant action detection branch, and fusing the optical flow characteristic action scores and the RGB characteristic action scores into class irrelevant action scores; dividing video frames in the training video according to the category perception action score and the category irrelevant action score, constructing corresponding positive and negative samples according to the type of each frame, and calculating comparison loss by utilizing embedded features respectively generated in the category perception action detection branch and the category irrelevant action detection branch to obtain total comparison loss; calculating class perception losses according to the class activation sequences, and calculating class irrelevant losses according to the optical flow characteristic action scores and the RGB characteristic action scores; respectively applying consistency constraint to the interior of the class-independent action detection branch and between the class-aware action detection branch and the class-independent action detection branch, and calculating corresponding consistency loss; constructing a target loss function by combining the total contrast loss, the category perception loss, the category irrelevant loss and the consistency loss, and training an action detection network by utilizing the target loss function; wherein RGB represents three color channels of red, green and blue;

The action detection unit is used for inputting the optical flow characteristics and RGB characteristics of the video to be detected into the trained action detection network, and generating an action detection result by combining the class activation sequence of the class-aware action detection branch, the optical flow characteristic action score and RGB characteristic action score of the class-independent action detection branch.

A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.

According to the technical scheme provided by the invention, according to the action prediction structures of two different branches, fuzzy video frames (video frames influencing action positioning) with inconsistent prediction results are screened out, and the feature learning of the inconsistent video frames can be promoted by using contrast learning; the difference of the double branches is reduced by carrying out consistency constraint on the double branches, so that the problem that enough video frames cannot be maintained for contrast learning due to overlarge difference of the double branches is solved, and meanwhile, the reliability of an action detection result is further enhanced by utilizing collaborative learning of the double branches (namely action consistency loss introduced later); the influence caused by pseudo tag errors when the class independent action detection branch is trained can be reduced by applying the consistency constraint in the class independent action detection branch; in general, the invention improves the detection performance of the motion detection network and ensures the accuracy and reliability of motion detection by utilizing contrast learning and consistency constraint.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a weak supervision action detection method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an action detection network according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a weak supervision action detection system according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The following describes in detail a method, a system, a device and a storage medium for detecting weak supervision actions. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer.

Example 1

The embodiment of the invention provides a weak supervision action detection method, which designs an action detection network with a double-branch structure, screens out fuzzy sample frames with inconsistent prediction results according to action prediction results of two different branches, and uses contrast learning to promote feature learning of the inconsistent sample frames by using consistent sample frames with higher reliability. Meanwhile, the reliability of the prediction result is further enhanced by utilizing double-branch collaborative learning and modal consistency constraint. As shown in fig. 1, a flowchart of a related method provided by the present invention mainly includes the following steps:

And step 1, constructing an action detection network comprising a category-aware action detection branch and a category-independent action detection branch, and training the action detection network.

In the embodiment of the invention, the class-aware action detection branch and the class-independent action detection branch are constructed to form an action detection network under the inspiration of the current time sequence action detection technology.

The training mode of the motion detection network is as follows: inputting optical flow characteristics and RGB characteristics of a training video, obtaining a class activation sequence through a class perception action detection branch, generating class perception action scores, obtaining optical flow characteristic action scores and RGB characteristic action scores through a class irrelevant action detection branch, and fusing the optical flow characteristic action scores and the RGB characteristic action scores into class irrelevant action scores; dividing video frames in the training video according to the category perception action score and the category irrelevant action score, constructing corresponding positive and negative samples according to the type of each frame, and calculating comparison loss by utilizing embedded features respectively generated in the category perception action detection branch and the category irrelevant action detection branch to obtain total comparison loss; calculating class perception losses according to the class activation sequences, and calculating class irrelevant losses according to the optical flow characteristic action scores and the RGB characteristic action scores; respectively applying consistency constraint to the interior of the class-independent action detection branch and between the class-aware action detection branch and the class-independent action detection branch, and calculating corresponding consistency loss; and constructing a target loss function by combining the total contrast loss, the category perception loss, the category independence loss and the consistency loss, and training the action detection network by using the target loss function.

In the embodiment of the present invention, the RGB features are proper terms in the art, which are features extracted directly from video, for example, in examples provided later, RGB features are extracted using a pre-trained I3D model (behavior recognition model); wherein RGB represents three color channels of red, green and blue.

And 2, inputting the optical flow characteristics and the RGB characteristics of the video to be detected into a trained action detection network, and generating an action detection result by combining the class activation sequence of the class-aware action detection branch and the optical flow characteristic action score and the RGB characteristic action score of the class-independent action detection branch.

The method provided by the embodiment of the invention solves various problems existing in the prior art, and is a weak supervision action detection method guided by action inconsistency. Because the current weak supervision action detection method lacks frame-level label supervision, the model cannot make very definite judgment on each frame, and a plurality of fuzzy frames influencing action positioning exist. Therefore, the method constructs the category perception action detection branch and the category irrelevant action detection branch, divides the frame types by comparing the double branch prediction results, finds out the video frames with inconsistent prediction results, and divides all the fragments in the video into consistent action/background fragments and inconsistent action/background fragments as positive and negative samples of contrast learning, thereby enhancing the feature learning of the video frames by utilizing the contrast learning and improving the detection performance.

In order to more clearly demonstrate the technical scheme and the technical effects provided by the invention, the method provided by the embodiment of the invention is described in detail below by using specific embodiments.

1. Overall network architecture.

As shown in fig. 2, the action detection network provided in the embodiment of the present invention mainly includes: the category aware action detection branch is related to the category independent action detection branch.

1. Class-aware action detection branches.

As shown in fig. 2, the class-aware motion detection branch mainly includes a feature embedding layer f and a classification layer cls; the optical flow features and the RGB features are spliced and then input into a class perception action detection branch, the comprehensive embedded features E are obtained through a feature embedding layer f, class activation sequences CAS are obtained through a classification layer cls, the class activation sequences CAS are summed (sum functions) in class dimensions, and an activation function is used to obtain class perception action scores A. The above-described flow in the category-aware action detection branch may be expressed as:

E=f (F1)

CAS=cls(E)

A=sigmoid(sum(CAS))

wherein F1 represents the feature obtained by splicing the optical flow feature and the RGB feature, and sigmoid is an activation function.

In the embodiment of the invention, the category perception action scores of all video frames in the training video can be obtained through the category perception action detection branch.

2. Class independent actions detect branches.

In the embodiment of the invention, the class-independent action detection branch does not distinguish action classes, only distinguishes actions from background, and separates and processes optical flow characteristics and RGB characteristics, and the two characteristics use the same processing method, and finally the action scores of the two characteristics are fused. Specific: the class independent action detection branch comprises two sub-branches, and optical flow characteristics and RGB characteristics are input into one sub-branch in a one-to-one manner; after the optical flow features and the RGB features are input into the corresponding sub-branches, the corresponding embedded features are obtained through the embedding layers in the sub-branches, and then the action scores corresponding to the corresponding features are obtained through the classification layers and the activation functions.

As shown in fig. 2, the first sub-branch (the upper sub-branch in the class-independent action detection branch) includes an RGB embedding layer f _R And RGB classification layer cls _R Input RGB feature F _R Through RGB embedded layer f _R Obtaining RGB embedded feature E _R Then pass through RGB classification layer cls _R Processing with the activation function to obtain RGB feature action score A _R . The second sub-branch (the lower sub-branch in the class-independent action detection branch) comprises an optical flow embedding layer f _F And optical flow classification layer cls _F Input optical flow feature F _F Through optical flow embedding layer f _F Obtaining optical flow embedded feature E _F Then pass through the optical flow classification layer cls _F Processing with the activation function to obtain optical flow characteristic action score A _F . Obtaining a class independent action score A after fusing the two types of characteristic action scores _ca . The above-described flow in the category independent action detection branch can be expressed as:

E _* =f _* (F _* )

A _* =sigmoid(cls _* (E _* ))

A _ca =(A _F +A _R )/2

where R corresponds to RGB features and F corresponds to optical flow features. When R, the first two formulas represent the process flow of the first sub-branch; when =f, the first two equations represent the process flow of the second sub-branch. The third expression represents fusing the two types of feature action scores by means of averaging.

In the embodiment of the invention, the RGB features and the optical flow features in the class independent action detection branches are processed separately, each video frame has an RGB action score and an optical flow action score, and then the class independent action score of each video frame can be obtained through fusion.

2. Objective loss function for network training.

1. Category perception loss.

In an embodiment of the invention, a multi-instance learning paradigm and cross entropy loss function are used on class-aware motion detection branches. The scores of K1 video frames with the highest scores of each class in the class activation sequence CAS (namely, K1 video frames are respectively selected for each class) are aggregated to represent the scores of each class of training video; wherein K1 is a preset positive integer. The cross entropy loss is calculated as a class-aware loss using a given video level tag, expressed as:

；

wherein ,

representing class perception losses, C is the total number of classes, p _c To train the score of the c-th category of video, y _c A true tag representing the c-th category of training video, which belongs to a given video-level tag.

In the embodiment of the invention, each training video has a label corresponding to the action category appearing in the video, namely a video-level label.

In the embodiment of the present invention, the value of K1 varies for different data sets (the higher the duty ratio of each video action frame in the data set, the smaller the K1), generally speaking, it may be set that: k1 L/8 to L/2, L is the total frame number in the training video.

2. Category independent loss.

In the embodiment of the invention, a method is designedAs shown in fig. 2, the line pseudo tag generation strategy is that in the class-aware motion detection branch, the class activation sequence CAS is normalized in the class dimension by a softmax function (normalized exponential function), the RGB feature motion scores and the optical flow feature motion scores of all video frames output by the class-independent motion detection branch are respectively formed into corresponding motion sequences, which are called a first motion sequence and a second motion sequence, the normalized class activation sequence CAS, the first motion sequence and the second motion sequence are fused (for example, the average value is obtained after the three are added) to obtain a comprehensive class activation sequence, then K1 video frames with the highest score are found out in the comprehensive class activation sequence according to the motion class in a given video tag, and the found K1 video frames are marked as a motion set

The remaining video frames are marked as a background set T ^b The two finally obtained sets are the frame-level pseudo tags.

And then, respectively calculating corresponding generalized cross entropy loss by utilizing the optical flow characteristic action score and the RGB characteristic action score, wherein the generalized cross entropy loss is expressed as:

；

where q is the noise tolerance coefficient, 0<q is less than or equal to 1, and the larger q is, the higher the noise tolerance is;

representing generalized cross entropy loss calculated using RGB feature action scores or optical flow feature action scores, < ->

Representing action set +.>

RGB feature action score or optical flow feature action score of the t-th video frame; />

Representing a background set T ^b Middle->

RGB feature action scores or optical flow feature action scores for individual video frames; />

、/>

Representing action sets +.>

Background set T ^b Video frames of the video frame.

Adding the generalized cross entropy loss calculated by the RGB feature action score and the generalized cross entropy loss calculated by the optical flow feature action score to obtain a category independent loss, expressed as:

；

wherein ,

for category independent loss, ++>

To generalize the cross entropy penalty computed using the RGB feature action scores,

generalized cross entropy loss calculated for action scores using optical flow features.

3. Consistency loss.

In the embodiment of the invention, consistency constraints are respectively applied between the category-aware action detection branches and the category-independent action detection branches and inside the category-independent action detection branches, and corresponding consistency losses are calculated.

(1) Consistency constraints are respectively applied to the interior of the class independent action detection branches.

In the embodiment of the invention, multiple applications are applied to the interior of a category-independent action detection branchModality consistency constraints, specific: in order to promote mutual learning of bimodal features (RGB features and optical flow features), the embodiment of the invention designs a multimodal consistency constraint, learns the action scores of the RGB features and the action scores of the optical flow features as soft labels of each other, and calculates multimodal consistency loss

Expressed as:

；/>

wherein ,A_R Representing RGB feature action score, A _F Representing the action score of the optical flow characteristics,

representing the mean square error loss function, ">

Representing the phase gradient pass back.

(2) A consistency constraint is imposed between the category-aware action detection branch and the category-independent action detection branch.

In the embodiment of the invention, action consistency constraint is applied between the category-aware action detection branch and the category-independent action detection branch, and the method specifically comprises the following steps: in order to avoid that the prediction difference between the class-aware motion detection branch and the class-independent motion detection branch in the later training stage is too large to generate enough consistency samples, the embodiment of the invention designs motion consistency constraint to promote the mutual learning of the two branches and keep normal difference, and the motion consistency loss calculated by the part is expressed as:

；

wherein ,

indicating a loss of motion consistency, A _ca Representing class independent action scores, A representing class aware actionsA score was made.

4. Total contrast loss.

In the embodiment of the invention, for each branch, taking the median of all prediction scores in the corresponding branch, preliminarily marking video frames higher than the median as actions, preliminarily marking video frames lower than the median as backgrounds, and obtaining preliminary action background prediction results of two branches, wherein for a category-aware action detection branch, the prediction score is a category-aware action score, and for a category-independent action detection branch, the prediction score is a category-independent action score; comparing the preliminary action background prediction results of the two branches, and dividing the video frames which are preliminarily marked as actions by the two branches into consistent action frames (CA); the video frames with the background marked by both branches are divided into consistent background frames (CB); for the two branches, primarily marking inconsistent video frames (namely the fuzzy video frames) and sorting according to category perception action scores, dividing the K2 video frames with the highest scores into inconsistent action frames (IA), and dividing the K2 video frames with the lowest scores into inconsistent background frames (IB); wherein K2 is a preset positive integer.

In the embodiment of the invention, the following steps can be set: k2 L/20, L is the total number of frames in the training video.

Four different positive and negative sample pairs are constructed for four different frame types: for the consistent action frame, the positive sample is an inconsistent action frame, and the negative sample is a consistent background frame; for the consistent background frames, positive samples are inconsistent background frames, and negative samples are consistent action frames; for inconsistent action frames, positive samples are consistent action frames, and negative samples are consistent background frames; for inconsistent background frames, the positive samples are consistent background frames, and the negative samples are consistent action frames. The right side of fig. 2 intuitively shows the correspondence between the above four different frame types and positive and negative samples, the sign +: indicates positive samples, and the sign-: indicates negative samples;

preliminary action background prediction result representing class-aware action detection branch,/for>

Preliminary action background prediction results representing class independent action detection branches.

In the embodiment of the invention, three types of embedded features are generated through the two branches, namely: the integrated embedded features (combination of RGB embedded features and optical flow embedded features) in the class-aware motion detection branch, and the RGB embedded features and optical flow embedded features in the class-independent motion detection branch. The three types of embedded features are used for respectively calculating the corresponding contrast loss, the calculation principle is the same, the difference is mainly that the types of the embedded features are different, and finally the total contrast loss is synthesized.

In the embodiment of the invention, each frame type is respectively used as an anchor sample, the corresponding positive and negative samples are combined to calculate the contrast loss, the info NCE is used as a contrast loss function, the distance between the anchor sample and the positive sample is reduced, the distance between the anchor sample and the negative sample is enlarged, and the method is expressed as:

；/>

wherein ,

representing the embedded features of the anchor samples, T being a transposed symbol, X representing a set of embedded features of all video frames, each of which is taken in turn as an anchor sample; />

Indicating contrast loss->

Mean value representing embedded features of all positive samples, +.>

Representing the embedded features of the i-th negative sample, M being the set of embedded features of the negative sample, +.>

The punishment force is used for controlling the comparison learning for the temperature parameter; the embedded features include: comprehensive embedded features in category-aware motion detection branches, RGB embedded features and optical flow embedded features in category-independent motion detection branches.

As will be appreciated by those skilled in the art, infoNCE is a learner-supplied loss function name, which is an industry proprietary name, where NCE is commonly referred to as Noise-contrastive estimation (Noise contrast estimation).

When the embedded feature is a comprehensive embedded feature, four different frame types are respectively used as anchor samples, corresponding contrast loss is calculated, and the four different frame types are synthesized to be used as a first contrast loss, and are expressed as follows:

；

wherein ,

representing a first contrast loss; />

Indicating the contrast loss calculated by using the video frame set included in the anchor sample when the inconsistent action frame is used as the anchor sample, < > and->

Indicating the contrast loss calculated by using the video frame set included in the anchor sample when the inconsistent background frame is used as the anchor sample, < > and->

Indicating the contrast loss calculated by using the video frame set included in the anchor sample when the consistency action frame is used as the anchor sample, < > therein>

The contrast loss calculated by using the video frame set included in the anchor sample when the consistency background frame is used as the anchor sample is shown.

When the embedded feature is RGB embeddedWhen the characteristic is entered, four different frame types are respectively used as anchor samples, corresponding contrast loss is calculated and is synthesized to be used as second contrast loss

The method comprises the steps of carrying out a first treatment on the surface of the When the embedded feature is an optical flow embedded feature, four different frame types are respectively used as anchor samples, corresponding contrast loss is calculated, and the four different frame types are combined to be used as third contrast loss ∈>

。

Finally, the total contrast loss is the sum of the above first, second, and third contrast losses, expressed as:

；

wherein ,

indicating the total contrast loss.

The objective loss function at training time is constructed based on all the above categories of losses, expressed as:

；

wherein ,

for the objective loss function, AIGL represents the motion detection network constructed by the present invention, < + >>

、/>

And->

Three super parameters are used for controlling the weight of the related loss items; exemplary: />

、/>

，/>

。

Based on the objective loss function, training the motion detection network until convergence, and continuously optimizing parameters in the motion detection network in the training process, wherein the related flow can be realized by referring to a conventional technology, and the invention is not repeated.

3. And detecting actions.

In the embodiment of the invention, the motion detection is carried out on the video to be detected by utilizing the trained motion detection network, the optical flow characteristics and RGB characteristics of the video to be detected are input into the trained motion detection network by referring to the mode of the step 1, a class activation sequence is obtained through a class perception motion detection branch, and the optical flow characteristic motion score and RGB characteristic motion score of all video frames are obtained through a class irrelevant motion detection branch; and fusing the class activation sequence, the optical flow characteristic action scores and the RGB characteristic action scores of all video frames to obtain a comprehensive class activation sequence, and performing action class prediction and action positioning prediction by using the comprehensive class activation sequence to obtain an action detection result.

The method provided by the embodiment of the invention mainly has the following advantages:

(1) The method for comparing and learning based on the inconsistency of the double-branch actions is designed, consistent video frames and inconsistent video frames are found out by comparing the prediction results of the category-aware action detection branches and the category-independent action detection branches, positive and negative samples of the comparison and learning are defined, and therefore feature learning of the video frames is promoted by the aid of the comparison and learning.

(2) The action consistency constraint is provided to reduce the difference of the double branches, so that the problem that enough consistency actions cannot be maintained to perform contrast learning with the background fragments due to overlarge difference of the double branches is solved. And meanwhile, the reliability of the prediction score is further enhanced by utilizing collaborative learning of the double branches.

(3) The modal consistency constraint is designed to reduce the impact of pseudo tag errors when training class independent action detection branches.

Overall, the performance of the motion detection network can be improved through the advantages, and the accuracy and the reliability of motion detection are ensured.

In order to intuitively understand the implementation process of the above method of the present invention, the following provides relevant implementation flow examples.

Step S1, preparing a video data set for training and a video data set for testing. For training videos, it is necessary to perform video-level action category labeling on each video, that is, labeling what action exists in the video, and generating a video-level label. The training video data is then sampled once every several frames (e.g., 16 frames), the downsampled video data is input into an I3D model (behavior recognition model) pre-trained by a kinetic dataset (a large dataset for video motion recognition), and RGB features and optical flow features of the video data are extracted.

And S2, constructing a category-aware action detection branch and a category-independent action detection branch by using a full convolution network based on a Pytorch (open-source machine learning library) deep learning framework to obtain a category-aware action score A and a category-independent action score of each video frame.

And step S3, performing mean square error loss on the action score of the class-aware action detection branch and the class-independent action score of the class-independent action detection branch, and calculating action consistency loss.

And S4, for the class perception action detection branch, according to the class activation sequence, finding out K frames with the highest score of each class, generating an action score of a video level, and calculating class perception loss with a given video level label.

And S5, regarding the class-independent action detection branches, comprising RGB sub-branches and optical flow sub-branches, performing mean square error loss on action scores of the two sub-branches, and calculating multi-mode consistency loss.

And S6, performing preliminary action background marks on action scores of the two branches according to the median, wherein the preliminary marks higher than the median are actions, and the marks lower than the median are backgrounds. A frame marked as an action in both branches is marked as a consistent action frame; the frames marked as the background in both branches are marked as consistent background frames; different frames are predicted in the two branches, the frames are ordered according to the action scores, K3 with the highest score are inconsistent action frames, and K3 with the lowest score are inconsistent background frames.

And S7, constructing positive and negative samples by using marked frames of four different types, respectively calculating contrast loss by using comprehensive embedding characteristics in the category-aware motion detection branch, RGB embedding characteristics and optical flow embedding characteristics in the category-independent motion detection branch, and adding to obtain total contrast loss.

And S8, constructing a target loss function by using the calculated loss in the step, minimizing the target loss function through a back propagation algorithm and a gradient descent strategy, and updating parameters of the action detection network.

And S9, inputting a test data set, performing action category prediction and action positioning prediction, and performing performance evaluation by combining the two types of prediction results.

(1) And (5) predicting action categories.

Adding the class activation sequence of the class-aware action detection branch and the RGB feature action score and the optical flow feature action score of the class-independent action detection branch to generate a comprehensive class activation sequence, aggregating the K1 video frames with the highest class scores by using the comprehensive class activation sequence to obtain the action classification probability of the video level (specifically, the scores of the K1 video frames with the highest class scores can be averaged, the average is used as the score of the video on each class), and generating the action class prediction result by using a threshold method.

The RGB feature action score and the optical flow feature action score are both information in a video frame, and for the whole video, the RGB feature action score and the optical flow feature action score are both a sequence, so that the RGB feature action score and the optical flow feature action score corresponding to the whole video can be called as an action sequence and can be directly fused with a class activation sequence, for example, corresponding elements in the three sequences are added and averaged. When the action category prediction result is generated by using the thresholding method, a threshold (for example, 0.1 to 0.25) may be set according to the actual situation, and the category exceeding the threshold is marked as the predicted category.

(2) And (5) predicting action positioning.

And taking out a sequence of a corresponding class from the comprehensive class activation sequence by utilizing an action class prediction result, setting a plurality of thresholds (for example, 0.1-0.9 at intervals of 0.1), sequentially screening the sequences by using the thresholds, regarding a continuous frame higher than the thresholds as a prediction frame, calculating the confidence coefficient of the prediction frame, obtaining a large number of prediction frames, screening the prediction frames by using a non-maximum value inhibition method, and removing the prediction frames with excessively high partial overlap ratio, wherein the screened prediction frames are the action positioning prediction result.

(3) Performance evaluation.

Finally, the detection performance of the motion detection network is evaluated according to the motion category prediction result and the motion positioning prediction result (motion detection result).

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

Example two

The invention also provides a weak supervision action detection system, which is mainly realized based on the method provided by the foregoing embodiment, as shown in fig. 3, and mainly comprises:

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

Example III

The present invention also provides a processing apparatus, as shown in fig. 4, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.

In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:

the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

the output device may be a display terminal;

the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.

Example IV

The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A method of weakly supervised motion detection, comprising:

2. A method of weakly supervised motion detection as set forth in claim 1, wherein,

the optical flow characteristics and RGB characteristics of the input training video are obtained through a category perception action detection branch, and the generation of the category perception action score comprises the following steps: splicing the optical flow features with the RGB features, obtaining comprehensive embedded features through a feature embedding layer in a category perception action detection branch, obtaining a category activation sequence CAS through a classification layer, summing the category activation sequences CAS in category dimensions, and using an activation function to obtain a category perception action score A;

the obtaining the optical flow characteristic action score and the RGB characteristic action score through the category-independent action detection branch comprises the following steps: the class independent action detection branch comprises two sub-branches, and optical flow characteristics and RGB characteristics are input into one sub-branch in a one-to-one manner; after the optical flow features and the RGB features are input into the corresponding sub-branches, the corresponding embedded features are obtained through the embedding layers in the sub-branches, and then the action scores corresponding to the corresponding features are obtained through the classification layers and the activation functions.

3. A method of weakly supervised action detection as claimed in claim 1 or 2, wherein said calculating class awareness losses from class activation sequences comprises:

the scores of K1 video frames with the highest scores of each class in the class activation sequence CAS are aggregated to represent the scores of each class of training video; wherein K1 is a preset positive integer;

the cross entropy loss is calculated as a class-aware loss using a given video level tag, expressed as:

；

wherein ,

4. The method of claim 1, wherein calculating class independent losses from the optical flow feature action score and the RGB feature action score comprises:

finding out K1 video frames with highest scores in the comprehensive class activation sequence according to the action class in a given video tag, and marking the found K1 video frames as an action set

The remaining video frames are marked as a background set T ^b The method comprises the steps of carrying out a first treatment on the surface of the The RGB feature action scores and the optical flow feature actions of all video frames output by the class irrelevant action detection branch respectively form corresponding action sequences, namely a first action sequence and a second action sequence, and class activation sequences of the class perception action detection branch are normalized in class dimension and then fused with the first action sequence and the second action sequence to obtain a comprehensive class activation sequence; k1 is a preset positive integer;

；

wherein, R corresponds to RGB features, F corresponds to optical flow features, q is a noise tolerance coefficient, 0 = { R, F }<q≤1，

Representing action set +.>

Representing a background set T ^b Middle->

、/>

Representing action sets +.>

Background set T ^b The number of video frames;

and adding the generalized cross entropy loss calculated by using the RGB feature action scores and the generalized cross entropy loss calculated by using the optical flow feature action scores to obtain category independent loss.

5. The method of claim 1, wherein the classifying the video frames in the training video according to the category-aware motion score and the category-independent motion score comprises:

for each branch, taking the median of all prediction scores in the corresponding branch, preliminarily marking video frames higher than the median as actions, preliminarily marking video frames lower than the median as backgrounds, and obtaining preliminary action background prediction results of the two branches, wherein the prediction scores are category-aware action scores for category-aware action detection branches and category-independent action scores for category-independent action detection branches; comparing the preliminary action background prediction results of the two branches, and dividing the video frames which are preliminarily marked as actions by the two branches into consistent action frames; the video frames with the background marked by both branches are divided into consistent background frames; for the two branches, primarily marking inconsistent video frames, sorting according to the action scores, dividing the K2 video frames with the highest scores into inconsistent action frames, and dividing the K2 video frames with the lowest scores into inconsistent background frames; wherein K2 is a preset positive integer;

Four different positive and negative sample pairs are constructed for four different frame types: for the consistent action frame, the positive sample is an inconsistent action frame, and the negative sample is a consistent background frame; for the consistent background frames, positive samples are inconsistent background frames, and negative samples are consistent action frames; for inconsistent action frames, positive samples are consistent action frames, and negative samples are consistent background frames; for inconsistent background frames, the positive samples are consistent background frames, and the negative samples are consistent action frames.

6. The method of claim 5, wherein the calculating the contrast loss using the embedded features generated in the class-aware motion detection branch and the class-independent motion detection branch respectively, the obtaining the total contrast loss comprises:

each frame type is used as an anchor sample, and the contrast loss is calculated by combining the corresponding positive and negative samples, and is expressed as follows:

；

wherein ,

representing embedded features of the anchor samples, T being a transposed symbol, X representing a set of embedded features of all video frames;

indicating contrast loss->

Mean value representing embedded features of all positive samples, +.>

Representing the embedded features of the i-th negative sample, M is the embedded feature set of the negative sample, +.>

The punishment force is used for controlling the comparison learning for the temperature parameter; the embedded features include: comprehensive embedded features in category-aware motion detection branches, RGB embedded features and optical flow embedded features in category-independent motion detection branches;

when the embedded feature is a comprehensive embedded feature, respectively taking four different frame types as anchor samples, calculating corresponding contrast loss, and integrating the contrast loss as a first contrast loss; when the embedded features are RGB embedded features, respectively taking four different frame types as anchor samples, calculating corresponding contrast loss, and combining the four different frame types as second contrast loss; when the embedded feature is an optical flow embedded feature, respectively taking four different frame types as anchor samples, calculating corresponding contrast loss, and integrating the contrast loss as a third contrast loss; the sum of the first contrast loss, the second contrast loss and the third contrast loss is the total contrast loss.

7. The method of claim 1, wherein the calculating the corresponding consistency loss by respectively applying a consistency constraint between the class-aware motion detection branch and the class-independent motion detection branch and on the interior of the class-independent motion detection branch comprises:

Applying action consistency constraint between class-aware action detection branches and class-independent action detection branches, and calculating action consistency loss

Expressed as:

；

wherein ,

representing the mean square error lossFunction (F)>

Representing stage gradient return; a is that _ca Representing class independent action scores, A representing class aware action scores;

applying multi-modal consistency constraints to the interior of the class-independent action detection branch, and calculating multi-modal consistency loss

Expressed as:

；

wherein ,A_R Representing RGB feature action score, A _F Representing the optical flow feature action score.

8. A weakly supervised motion detection system, characterized in that it is implemented based on the method of any of claims 1-7, the system comprising:

9. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium storing a computer program, which when executed by a processor implements the method of any one of claims 1-7.