CN111783713B

CN111783713B - Weak supervision time sequence behavior positioning method and device based on relation prototype network

Info

Publication number: CN111783713B
Application number: CN202010659078.XA
Authority: CN
Inventors: 王亮; 黄岩; 黄林江
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2020-07-09
Filing date: 2020-07-09
Publication date: 2022-12-02
Anticipated expiration: 2040-07-09
Also published as: CN111783713A

Abstract

The invention relates to a weak supervision time sequence behavior positioning method and device based on a relation prototype network. In order to solve the problems that the prior art is time-consuming and labor-consuming for training a network model by artificial labeling information and subjective factors are introduced, the invention provides a weak supervision time sequence behavior positioning method based on a relation prototype network, which comprises the steps of dividing a video to be identified into a plurality of video segments according to a preset time interval, inputting a pre-trained behavior positioning model into an optical flow image and the plurality of video segments corresponding to each video segment; determining a first similarity between the human behavior in each video clip and a preset target behavior through a behavior positioning model; and determining the behavior category to which the human behavior belongs in each video clip according to the comparison result of the first similarity and a preset threshold. The method can model the relation between different behaviors, and can make the characteristics of each part of the behavior as close as possible through clustering loss, thereby realizing the positioning of the complete behavior segment.

Description

Weak supervision time sequence behavior positioning method and device based on relation prototype network

Technical Field

The invention relates to the technical field of computer vision, in particular to a weak supervision time sequence behavior positioning method and device based on a relation prototype network.

Background

Behavior localization is an important and challenging classical computer vision task, and has wide application in the fields of security monitoring, intelligent video analysis, video retrieval and the like.

Most of the existing behavior positioning methods adopt a full-supervision mode, and adopt frame-by-frame labeling as supervision information to train a model, but the frame-by-frame labeling is time-consuming and labor-consuming, and artificial labeling usually introduces subjective factors.

Therefore, how to propose a solution to the problems of the prior art is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

In order to solve the above problems in the prior art, that is, to solve the problems of time and labor consumption for training a network model and introduction of subjective factors caused by artificial labeling information in the prior art, a first aspect of the present invention provides a method for positioning a weakly supervised temporal behavior based on a relationship prototype network, where the method includes:

dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, wherein the behavior positioning model is a model constructed based on a deep convolutional neural network, and performing video behavior positioning based on a preset training sample;

determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model;

and determining the behavior category to which the human behavior in each video clip belongs according to the comparison result of the first similarity and a preset threshold.

In a possible implementation manner, the behavior localization model includes multiple layers of graph convolution layers and pooling layers, and the "first similarity between the human behavior in each video clip and the preset target behavior is determined through the behavior localization model", and the method includes:

acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;

based on preset category characteristics and a preset symbiotic matrix corresponding to a plurality of target behaviors, acquiring second output characteristics corresponding to behavior prototypes corresponding to the human behaviors through a graph volume layer of the behavior positioning model;

based on the first output feature and the second output feature, acquiring Euclidean distances of the first output feature and the second output feature through the behavior positioning model;

and determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.

In a possible implementation manner, before the step of inputting the optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior localization model, the method further includes training the behavior localization model, and the method for training the behavior localization model includes:

converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;

according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;

carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;

according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;

and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.

In a possible implementation manner, the loss function corresponding to the second similarity includes:

wherein, the first and the second end of the pipe are connected with each other,

represents the value of the loss function corresponding to the second similarity, and delta () represents a conditional function, which is 1 when the condition is satisfied, otherwise, is 0,y _i Representing the probability, s, corresponding to the behavior class i _i Indicating the ith value of the second similarity.

In a possible implementation manner, after the step of "obtaining, according to the second similarity, a probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior localization model to be trained", the method further includes:

obtaining the deviation of the first probability and the second probability according to the following formula:

wherein C represents the number of behavior categories,

representing normalized true probability, p, corresponding to behavior class i _i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.

Preferably, after the step of "obtaining the first probability", and before the step of "training the behavior localization model by using a back propagation algorithm and a stochastic gradient descent algorithm", the method further includes obtaining a global loss of the behavior localization model according to a method shown by the following formula:

wherein the content of the first and second substances,

representing a global penalty of the behavioral localization model to be trained,

representing the value of the loss function to which said first probability corresponds, alpha representing an adjustment parameter for adjusting the ratio of the weights of the loss function,

the value of the loss function corresponding to the second similarity is represented.

In another aspect of the present invention, a weak supervision timing behavior positioning apparatus based on a relationship prototype network is further provided, including:

the system comprises a first module, a second module and a third module, wherein the first module is used for dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, the behavior positioning model is a model constructed based on a deep convolutional neural network, and video behavior positioning is carried out based on a preset training sample;

the second module is used for determining the first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model;

and the third module is used for determining the behavior category to which the human behavior in each video clip belongs according to the comparison result of the first similarity and a preset threshold.

In one possible implementation, the behavior localization model includes a multilayer graph convolution layer and a pooling layer, and the second module is further configured to:

based on preset category characteristics and a preset symbiotic matrix corresponding to a plurality of target behaviors, acquiring second output characteristics corresponding to behavior prototypes corresponding to human behaviors in the optical flow image through the convolutional layers of the behavior positioning model;

based on the first output characteristic and the second output characteristic, acquiring Euclidean distance between the first output characteristic and the second output characteristic through the behavior positioning model;

In one possible implementation manner, the apparatus further includes a fourth module configured to:

represents the value of the loss function corresponding to the second similarity, and delta () represents a conditional function, which is 1 when the condition is satisfied, otherwise, is 0,y _i Representing the probability, s, corresponding to the behavior class i _i Indicating the ith value of the second degree of similarity.

In one possible implementation manner, the fourth module is further configured to:

wherein C represents the number of behavior categories,

representing the true probability, p, of the behavior class i _i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the relation prototype network-based weakly supervised temporal behavior localization method as described above.

Another aspect of the present invention further provides a non-transitory computer readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the method for weakly supervised temporal behavior localization based on a relation prototype network as described above is implemented.

The method and the device for positioning the weakly supervised time sequence behaviors based on the relation prototype network can model the relation between different behaviors, and can make the characteristics of each part of the behaviors as close as possible through clustering loss, thereby realizing the positioning of complete behavior segments.

Drawings

FIG. 1 is a schematic flow chart of a method for positioning weakly supervised temporal behavior based on a relationship prototype network according to the present invention;

fig. 2 is a schematic structural diagram of the weakly supervised time series behavior localization apparatus based on the relation prototype network of the present invention.

Detailed Description

In order to make the embodiments, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the embodiments are some, but not all embodiments of the present invention. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

Fig. 1 exemplarily shows a flowchart of the weakly supervised temporal behavior localization method based on the relationship prototype network of the present invention. As shown in fig. 1, the method includes:

step S101, dividing a video to be recognized into a plurality of video segments according to a preset time interval, and inputting a pre-trained behavior positioning model into an optical flow image corresponding to each video segment and the plurality of video segments.

In one possible implementation manner, the behavior localization model is a model constructed based on a deep convolutional neural network, and video behavior localization is performed based on preset training samples.

And S102, determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model.

Step S103, determining the behavior category to which the human behavior in each video clip belongs according to the comparison result of the first similarity and a preset threshold.

acquiring a second output feature corresponding to a behavior prototype corresponding to the human behavior in the optical flow image through a graph volume layer of the behavior positioning model based on preset category features and a preset co-occurrence matrix corresponding to a plurality of target behaviors;

and determining the similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.

wherein the content of the first and second substances,

represents the value of the loss function corresponding to the second similarity, and δ () represents a conditional function, which is 1 when the condition is satisfied, otherwise 0,y _i Representing the probability, s, corresponding to the behavior class i _i Indicating the ith value of the second degree of similarity.

wherein C represents the number of behavior categories,

representing behavior classesTrue probability, p, for other i _i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.

The weak supervision time sequence behavior positioning method based on the relation prototype network can also comprise the following steps:

step S0, extracting corresponding optical flow images from 412 videos in the data set, dividing an original video and the optical flow images into a plurality of segments at intervals of 16 frames, and respectively inputting the segments into a model with the same structure (the original video is taken as an example in the following steps);

illustratively, take a behavior locator database as an example, which contains 412 uncut videos.

Step S1, normalizing the output data in the step S0 to be uniform in space size (224 multiplied by 224 pixels), then sending each input segment into a deep convolutional neural network with fixed parameters, wherein the network contains a plurality of layers of three-dimensional convolutional layers, and selecting the output characteristic X of the last average pooling layer (Avg Pool) of the convolutional neural network.

Step S2, converting the characteristic X acquired in the step S1 into a new vector expression X corresponding to each segment by utilizing a 2-layer convolution network (the size of a convolution kernel space is 1 multiplied by 1) _e (size T × D).

And S3, counting a symbiotic matrix A of behaviors by utilizing the training set.

And S4, inputting the feature vectors of all behavior types, acquiring a symbiotic matrix A in S3, and obtaining a prototype P corresponding to each behavior by adopting 2-layer graph convolution.

Step S5, obtaining new vector expression X by respectively utilizing S2 and S4 _e And the prototype P corresponding to each type, calculating Euclidean distance between the characteristic of each segment and each type of prototype, and taking the negative value of the Euclidean distance as the similarity between the characteristic and the prototype;

step S6, using the new vector expression X obtained in S2 _e Obtaining a temporal attention weight λ through 2 convolutional layers _t (size TgenerateD)；

Step S7, using the attention weight λ obtained in S6 _t Expressing X to the new vector obtained in S2 _e Carrying out weighted summation to obtain the similarity vector of the whole video for each prototype

Step S8, similarity vector in S7

Feeding into soft-max layer (along category dimension) to obtain probability distribution p of video for each category, calculating deviation from real category, and obtaining loss as

Wherein C represents the number of behavior categories,

representing the true probability, p, for the behavior class i _i And representing the prediction probability corresponding to the behavior class i.

Step S9, utilizing the similarity vector in S7

We constrain this with clustering penalties that make the behavior expression as close as possible to its corresponding prototype

Wherein δ () is a conditional expression, 1 when the condition is satisfied, otherwise 0,C represents the number of behavior classes, y _i Representing the true probability for the behavior class i,

similarity vector for prototype

The ith element of (2).

Step S10, calculating the global loss

Where α is a hyperparameter used to adjust the weight ratio between two losses, typically set to 0.01, respectively.

And S11, reducing the prediction overall error by adopting a back propagation algorithm and a random gradient descent method to train the model, obtaining a final behavior positioning model through multiple iterative training, usually iterating 1000 times on the whole data set, and stopping training when the loss is not converged any more.

S12, extracting corresponding optical flow images from 212 tested videos by using a trained behavior positioning model, normalizing the optical flow images to be uniform in spatial size (224 multiplied by 224 pixels), sending the optical flow images to the trained behavior positioning model to obtain the similarity S of each segment in the videos and the class probability p of the videos, and sending the similarity S to a soft-max layer (along the class dimension) to obtain the similarity S

Step S13, updating the learned prototype P by using the prototype updating strategy, as shown in method 1. Then recalculate the vector expression X of the fragment _e The Euclidean distance from the updated prototype is obtained by sending the negative number of the Euclidean distance to a soft-max layer (along the time dimension)

And in S12

Multiplication to obtain

Step S14, the video category probability p obtained in S12 is used for rejecting the category less than 0.1, and a preset threshold value (8 multiplied by 10 of the original video) is used ^-6 Light ofFlow 8X 10 ^-8 ) For those obtained in S13

And performing segmentation to locate the position of the behavior in the video. And (3) suppressing and removing redundant behavior positioning results from the results obtained by the original video and the optical flow by using a non-maximum value to obtain a final positioning result.

Referring to fig. 2, fig. 2 exemplarily shows a structural schematic diagram of the weakly supervised time series behavior localization apparatus based on the relation prototype network of the present invention. As shown in fig. 2, the weakly supervised time series behavior positioning apparatus based on the relationship prototype network provided by the present invention includes:

the system comprises a first module 1, a second module and a third module, wherein the first module is used for dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, the behavior positioning model is a model constructed based on a deep convolutional neural network, and video behavior positioning is carried out based on a preset training sample;

the second module 2 is used for determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior localization model;

and a third module 3, configured to determine, according to a comparison result between the first similarity and a preset threshold, a behavior category to which the human behavior in each video segment belongs.

In a possible implementation manner, the behavior localization model includes a multilayer graph convolution layer and a pooling layer, and the second module 2 is further configured to:

wherein the content of the first and second substances,

wherein C represents the number of behavior categories,

representing the true probability, p, for the behavior class i _i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.

Another aspect of the present application further provides a weak supervision timing behavior positioning apparatus based on a relationship prototype network, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the relation prototype network-based weakly supervised temporal behavior localization method as described above.

Another aspect of the present application further provides a non-transitory computer readable storage medium, on which computer program instructions are stored, and when executed by a processor, the computer program instructions implement the weak supervised temporal behavior localization method based on a relation prototype network described above.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In summary, the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A weak supervision time sequence behavior positioning method based on a relation prototype network is characterized by comprising the following steps:

determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model; the method comprises the following steps:

acquiring second output characteristics corresponding to behavior prototypes corresponding to human behaviors in the optical flow image through graph volume layers of the behavior positioning model based on preset category characteristics and preset symbiotic matrixes corresponding to a plurality of target behaviors;

determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance;

determining a behavior category to which the human behavior in each video clip belongs according to a comparison result of the first similarity and a preset threshold;

the behavior positioning method also comprises the step of training the behavior positioning model, and the training method of the pre-trained behavior positioning model comprises the following steps:

2. The method according to claim 1, wherein after the step of obtaining the probability that the human behavior in the optical flow image belongs to the preset target behavior category through the soft-max layer of the behavior localization model to be trained according to the second similarity, the method further comprises:

wherein C represents the number of behavior categories,

representing the true probability, p, for the behavior class i _i Representing a prediction probability corresponding to a behavior class i, the first probability representing the optical flowAnd the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.

3. A weakly supervised temporal behavior localization apparatus based on a relation prototype network is characterized by comprising:

the second module is used for determining the first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model; the method specifically comprises the following steps:

a third module, configured to determine, according to a comparison result between the first similarity and a preset threshold, a behavior category to which a human behavior in each video segment belongs;

the device also comprises a fourth module for converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;

4. The apparatus of claim 3, wherein the fourth module is further configured to:

wherein, C represents the number of behavior categories,

5. A weakly supervised temporal behavior localization apparatus based on a relation prototype network, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the method for weakly supervised temporal behavior localization based on a relational prototype network according to any of claims 1 or 2.

6. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the relationship prototype network-based weakly supervised temporal behavior localization method of any one of claims 1 or 2.