CN111783713B - Weak supervision time sequence behavior positioning method and device based on relation prototype network - Google Patents
Weak supervision time sequence behavior positioning method and device based on relation prototype network Download PDFInfo
- Publication number
- CN111783713B CN111783713B CN202010659078.XA CN202010659078A CN111783713B CN 111783713 B CN111783713 B CN 111783713B CN 202010659078 A CN202010659078 A CN 202010659078A CN 111783713 B CN111783713 B CN 111783713B
- Authority
- CN
- China
- Prior art keywords
- behavior
- probability
- human
- optical flow
- positioning model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 230000006399 behavior Effects 0.000 claims abstract description 288
- 230000003287 optical effect Effects 0.000 claims abstract description 60
- 238000012549 training Methods 0.000 claims abstract description 24
- 230000004807 localization Effects 0.000 claims description 33
- 230000002123 temporal effect Effects 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 239000011159 matrix material Substances 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000002372 labelling Methods 0.000 abstract description 5
- 230000006870 function Effects 0.000 description 15
- 239000013598 vector Substances 0.000 description 10
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 239000000126 substance Substances 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Molecular Biology (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a weak supervision time sequence behavior positioning method and device based on a relation prototype network. In order to solve the problems that the prior art is time-consuming and labor-consuming for training a network model by artificial labeling information and subjective factors are introduced, the invention provides a weak supervision time sequence behavior positioning method based on a relation prototype network, which comprises the steps of dividing a video to be identified into a plurality of video segments according to a preset time interval, inputting a pre-trained behavior positioning model into an optical flow image and the plurality of video segments corresponding to each video segment; determining a first similarity between the human behavior in each video clip and a preset target behavior through a behavior positioning model; and determining the behavior category to which the human behavior belongs in each video clip according to the comparison result of the first similarity and a preset threshold. The method can model the relation between different behaviors, and can make the characteristics of each part of the behavior as close as possible through clustering loss, thereby realizing the positioning of the complete behavior segment.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a weak supervision time sequence behavior positioning method and device based on a relation prototype network.
Background
Behavior localization is an important and challenging classical computer vision task, and has wide application in the fields of security monitoring, intelligent video analysis, video retrieval and the like.
Most of the existing behavior positioning methods adopt a full-supervision mode, and adopt frame-by-frame labeling as supervision information to train a model, but the frame-by-frame labeling is time-consuming and labor-consuming, and artificial labeling usually introduces subjective factors.
Therefore, how to propose a solution to the problems of the prior art is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problems of time and labor consumption for training a network model and introduction of subjective factors caused by artificial labeling information in the prior art, a first aspect of the present invention provides a method for positioning a weakly supervised temporal behavior based on a relationship prototype network, where the method includes:
dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, wherein the behavior positioning model is a model constructed based on a deep convolutional neural network, and performing video behavior positioning based on a preset training sample;
determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model;
and determining the behavior category to which the human behavior in each video clip belongs according to the comparison result of the first similarity and a preset threshold.
In a possible implementation manner, the behavior localization model includes multiple layers of graph convolution layers and pooling layers, and the "first similarity between the human behavior in each video clip and the preset target behavior is determined through the behavior localization model", and the method includes:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
based on preset category characteristics and a preset symbiotic matrix corresponding to a plurality of target behaviors, acquiring second output characteristics corresponding to behavior prototypes corresponding to the human behaviors through a graph volume layer of the behavior positioning model;
based on the first output feature and the second output feature, acquiring Euclidean distances of the first output feature and the second output feature through the behavior positioning model;
and determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.
In a possible implementation manner, before the step of inputting the optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior localization model, the method further includes training the behavior localization model, and the method for training the behavior localization model includes:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
In a possible implementation manner, the loss function corresponding to the second similarity includes:
wherein, the first and the second end of the pipe are connected with each other,represents the value of the loss function corresponding to the second similarity, and delta () represents a conditional function, which is 1 when the condition is satisfied, otherwise, is 0,y i Representing the probability, s, corresponding to the behavior class i i Indicating the ith value of the second similarity.
In a possible implementation manner, after the step of "obtaining, according to the second similarity, a probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior localization model to be trained", the method further includes:
obtaining the deviation of the first probability and the second probability according to the following formula:
wherein C represents the number of behavior categories,representing normalized true probability, p, corresponding to behavior class i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
Preferably, after the step of "obtaining the first probability", and before the step of "training the behavior localization model by using a back propagation algorithm and a stochastic gradient descent algorithm", the method further includes obtaining a global loss of the behavior localization model according to a method shown by the following formula:
wherein the content of the first and second substances,representing a global penalty of the behavioral localization model to be trained,representing the value of the loss function to which said first probability corresponds, alpha representing an adjustment parameter for adjusting the ratio of the weights of the loss function,the value of the loss function corresponding to the second similarity is represented.
In another aspect of the present invention, a weak supervision timing behavior positioning apparatus based on a relationship prototype network is further provided, including:
the system comprises a first module, a second module and a third module, wherein the first module is used for dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, the behavior positioning model is a model constructed based on a deep convolutional neural network, and video behavior positioning is carried out based on a preset training sample;
the second module is used for determining the first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model;
and the third module is used for determining the behavior category to which the human behavior in each video clip belongs according to the comparison result of the first similarity and a preset threshold.
In one possible implementation, the behavior localization model includes a multilayer graph convolution layer and a pooling layer, and the second module is further configured to:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
based on preset category characteristics and a preset symbiotic matrix corresponding to a plurality of target behaviors, acquiring second output characteristics corresponding to behavior prototypes corresponding to human behaviors in the optical flow image through the convolutional layers of the behavior positioning model;
based on the first output characteristic and the second output characteristic, acquiring Euclidean distance between the first output characteristic and the second output characteristic through the behavior positioning model;
and determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.
In one possible implementation manner, the apparatus further includes a fourth module configured to:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
In a possible implementation manner, the loss function corresponding to the second similarity includes:
wherein, the first and the second end of the pipe are connected with each other,represents the value of the loss function corresponding to the second similarity, and delta () represents a conditional function, which is 1 when the condition is satisfied, otherwise, is 0,y i Representing the probability, s, corresponding to the behavior class i i Indicating the ith value of the second degree of similarity.
In one possible implementation manner, the fourth module is further configured to:
obtaining the deviation of the first probability and the second probability according to the following formula:
wherein C represents the number of behavior categories,representing the true probability, p, of the behavior class i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
In another aspect of the present invention, a weak supervision timing behavior positioning apparatus based on a relationship prototype network is further provided, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the relation prototype network-based weakly supervised temporal behavior localization method as described above.
Another aspect of the present invention further provides a non-transitory computer readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the method for weakly supervised temporal behavior localization based on a relation prototype network as described above is implemented.
The method and the device for positioning the weakly supervised time sequence behaviors based on the relation prototype network can model the relation between different behaviors, and can make the characteristics of each part of the behaviors as close as possible through clustering loss, thereby realizing the positioning of complete behavior segments.
Drawings
FIG. 1 is a schematic flow chart of a method for positioning weakly supervised temporal behavior based on a relationship prototype network according to the present invention;
fig. 2 is a schematic structural diagram of the weakly supervised time series behavior localization apparatus based on the relation prototype network of the present invention.
Detailed Description
In order to make the embodiments, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the embodiments are some, but not all embodiments of the present invention. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
Fig. 1 exemplarily shows a flowchart of the weakly supervised temporal behavior localization method based on the relationship prototype network of the present invention. As shown in fig. 1, the method includes:
step S101, dividing a video to be recognized into a plurality of video segments according to a preset time interval, and inputting a pre-trained behavior positioning model into an optical flow image corresponding to each video segment and the plurality of video segments.
In one possible implementation manner, the behavior localization model is a model constructed based on a deep convolutional neural network, and video behavior localization is performed based on preset training samples.
And S102, determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model.
Step S103, determining the behavior category to which the human behavior in each video clip belongs according to the comparison result of the first similarity and a preset threshold.
The method and the device for positioning the weakly supervised time sequence behaviors based on the relation prototype network can model the relation between different behaviors, and can make the characteristics of each part of the behaviors as close as possible through clustering loss, thereby realizing the positioning of complete behavior segments.
In a possible implementation manner, the behavior localization model includes multiple layers of graph convolution layers and pooling layers, and the "first similarity between the human behavior in each video clip and the preset target behavior is determined through the behavior localization model", and the method includes:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
acquiring a second output feature corresponding to a behavior prototype corresponding to the human behavior in the optical flow image through a graph volume layer of the behavior positioning model based on preset category features and a preset co-occurrence matrix corresponding to a plurality of target behaviors;
based on the first output feature and the second output feature, acquiring Euclidean distances of the first output feature and the second output feature through the behavior positioning model;
and determining the similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.
In a possible implementation manner, before the step of inputting the optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior localization model, the method further includes training the behavior localization model, and the method for training the behavior localization model includes:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
In a possible implementation manner, the loss function corresponding to the second similarity includes:
wherein the content of the first and second substances,represents the value of the loss function corresponding to the second similarity, and δ () represents a conditional function, which is 1 when the condition is satisfied, otherwise 0,y i Representing the probability, s, corresponding to the behavior class i i Indicating the ith value of the second degree of similarity.
In a possible implementation manner, after the step of "obtaining, according to the second similarity, a probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior localization model to be trained", the method further includes:
obtaining the deviation of the first probability and the second probability according to the following formula:
wherein C represents the number of behavior categories,representing behavior classesTrue probability, p, for other i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
The weak supervision time sequence behavior positioning method based on the relation prototype network can also comprise the following steps:
step S0, extracting corresponding optical flow images from 412 videos in the data set, dividing an original video and the optical flow images into a plurality of segments at intervals of 16 frames, and respectively inputting the segments into a model with the same structure (the original video is taken as an example in the following steps);
illustratively, take a behavior locator database as an example, which contains 412 uncut videos.
Step S1, normalizing the output data in the step S0 to be uniform in space size (224 multiplied by 224 pixels), then sending each input segment into a deep convolutional neural network with fixed parameters, wherein the network contains a plurality of layers of three-dimensional convolutional layers, and selecting the output characteristic X of the last average pooling layer (Avg Pool) of the convolutional neural network.
Step S2, converting the characteristic X acquired in the step S1 into a new vector expression X corresponding to each segment by utilizing a 2-layer convolution network (the size of a convolution kernel space is 1 multiplied by 1) e (size T × D).
And S3, counting a symbiotic matrix A of behaviors by utilizing the training set.
And S4, inputting the feature vectors of all behavior types, acquiring a symbiotic matrix A in S3, and obtaining a prototype P corresponding to each behavior by adopting 2-layer graph convolution.
Step S5, obtaining new vector expression X by respectively utilizing S2 and S4 e And the prototype P corresponding to each type, calculating Euclidean distance between the characteristic of each segment and each type of prototype, and taking the negative value of the Euclidean distance as the similarity between the characteristic and the prototype;
step S6, using the new vector expression X obtained in S2 e Obtaining a temporal attention weight λ through 2 convolutional layers t (size TgenerateD);
Step S7, using the attention weight λ obtained in S6 t Expressing X to the new vector obtained in S2 e Carrying out weighted summation to obtain the similarity vector of the whole video for each prototype
Step S8, similarity vector in S7Feeding into soft-max layer (along category dimension) to obtain probability distribution p of video for each category, calculating deviation from real category, and obtaining loss asWherein C represents the number of behavior categories,representing the true probability, p, for the behavior class i i And representing the prediction probability corresponding to the behavior class i.
Step S9, utilizing the similarity vector in S7We constrain this with clustering penalties that make the behavior expression as close as possible to its corresponding prototype Wherein δ () is a conditional expression, 1 when the condition is satisfied, otherwise 0,C represents the number of behavior classes, y i Representing the true probability for the behavior class i,similarity vector for prototypeThe ith element of (2).
Step S10, calculating the global lossWhere α is a hyperparameter used to adjust the weight ratio between two losses, typically set to 0.01, respectively.
And S11, reducing the prediction overall error by adopting a back propagation algorithm and a random gradient descent method to train the model, obtaining a final behavior positioning model through multiple iterative training, usually iterating 1000 times on the whole data set, and stopping training when the loss is not converged any more.
S12, extracting corresponding optical flow images from 212 tested videos by using a trained behavior positioning model, normalizing the optical flow images to be uniform in spatial size (224 multiplied by 224 pixels), sending the optical flow images to the trained behavior positioning model to obtain the similarity S of each segment in the videos and the class probability p of the videos, and sending the similarity S to a soft-max layer (along the class dimension) to obtain the similarity S
Step S13, updating the learned prototype P by using the prototype updating strategy, as shown in method 1. Then recalculate the vector expression X of the fragment e The Euclidean distance from the updated prototype is obtained by sending the negative number of the Euclidean distance to a soft-max layer (along the time dimension)And in S12Multiplication to obtain
Step S14, the video category probability p obtained in S12 is used for rejecting the category less than 0.1, and a preset threshold value (8 multiplied by 10 of the original video) is used -6 Light ofFlow 8X 10 -8 ) For those obtained in S13And performing segmentation to locate the position of the behavior in the video. And (3) suppressing and removing redundant behavior positioning results from the results obtained by the original video and the optical flow by using a non-maximum value to obtain a final positioning result.
Referring to fig. 2, fig. 2 exemplarily shows a structural schematic diagram of the weakly supervised time series behavior localization apparatus based on the relation prototype network of the present invention. As shown in fig. 2, the weakly supervised time series behavior positioning apparatus based on the relationship prototype network provided by the present invention includes:
the system comprises a first module 1, a second module and a third module, wherein the first module is used for dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, the behavior positioning model is a model constructed based on a deep convolutional neural network, and video behavior positioning is carried out based on a preset training sample;
the second module 2 is used for determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior localization model;
and a third module 3, configured to determine, according to a comparison result between the first similarity and a preset threshold, a behavior category to which the human behavior in each video segment belongs.
In a possible implementation manner, the behavior localization model includes a multilayer graph convolution layer and a pooling layer, and the second module 2 is further configured to:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
acquiring a second output feature corresponding to a behavior prototype corresponding to the human behavior in the optical flow image through a graph volume layer of the behavior positioning model based on preset category features and a preset co-occurrence matrix corresponding to a plurality of target behaviors;
based on the first output characteristic and the second output characteristic, acquiring Euclidean distance between the first output characteristic and the second output characteristic through the behavior positioning model;
and determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.
In one possible implementation manner, the apparatus further includes a fourth module configured to:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
In a possible implementation manner, the loss function corresponding to the second similarity includes:
wherein the content of the first and second substances,represents the value of the loss function corresponding to the second similarity, and delta () represents a conditional function, which is 1 when the condition is satisfied, otherwise, is 0,y i Representing the probability, s, corresponding to the behavior class i i Indicating the ith value of the second degree of similarity.
In one possible implementation manner, the fourth module is further configured to:
obtaining the deviation of the first probability and the second probability according to the following formula:
wherein C represents the number of behavior categories,representing the true probability, p, for the behavior class i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
Another aspect of the present application further provides a weak supervision timing behavior positioning apparatus based on a relationship prototype network, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the relation prototype network-based weakly supervised temporal behavior localization method as described above.
Another aspect of the present application further provides a non-transitory computer readable storage medium, on which computer program instructions are stored, and when executed by a processor, the computer program instructions implement the weak supervised temporal behavior localization method based on a relation prototype network described above.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In summary, the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (6)
1. A weak supervision time sequence behavior positioning method based on a relation prototype network is characterized by comprising the following steps:
dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, wherein the behavior positioning model is a model constructed based on a deep convolutional neural network, and performing video behavior positioning based on a preset training sample;
determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model; the method comprises the following steps:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
acquiring second output characteristics corresponding to behavior prototypes corresponding to human behaviors in the optical flow image through graph volume layers of the behavior positioning model based on preset category characteristics and preset symbiotic matrixes corresponding to a plurality of target behaviors;
based on the first output feature and the second output feature, acquiring Euclidean distances of the first output feature and the second output feature through the behavior positioning model;
determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance;
determining a behavior category to which the human behavior in each video clip belongs according to a comparison result of the first similarity and a preset threshold;
the behavior positioning method also comprises the step of training the behavior positioning model, and the training method of the pre-trained behavior positioning model comprises the following steps:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
2. The method according to claim 1, wherein after the step of obtaining the probability that the human behavior in the optical flow image belongs to the preset target behavior category through the soft-max layer of the behavior localization model to be trained according to the second similarity, the method further comprises:
obtaining the deviation of the first probability and the second probability according to the following formula:
wherein C represents the number of behavior categories,representing the true probability, p, for the behavior class i i Representing a prediction probability corresponding to a behavior class i, the first probability representing the optical flowAnd the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
3. A weakly supervised temporal behavior localization apparatus based on a relation prototype network is characterized by comprising:
the system comprises a first module, a second module and a third module, wherein the first module is used for dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, the behavior positioning model is a model constructed based on a deep convolutional neural network, and video behavior positioning is carried out based on a preset training sample;
the second module is used for determining the first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model; the method specifically comprises the following steps:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
acquiring a second output feature corresponding to a behavior prototype corresponding to the human behavior in the optical flow image through a graph volume layer of the behavior positioning model based on preset category features and a preset co-occurrence matrix corresponding to a plurality of target behaviors;
based on the first output feature and the second output feature, acquiring Euclidean distances of the first output feature and the second output feature through the behavior positioning model;
determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance;
a third module, configured to determine, according to a comparison result between the first similarity and a preset threshold, a behavior category to which a human behavior in each video segment belongs;
the device also comprises a fourth module for converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
4. The apparatus of claim 3, wherein the fourth module is further configured to:
obtaining the deviation of the first probability and the second probability according to the following formula:
wherein, C represents the number of behavior categories,representing the true probability, p, of the behavior class i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
5. A weakly supervised temporal behavior localization apparatus based on a relation prototype network, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the method for weakly supervised temporal behavior localization based on a relational prototype network according to any of claims 1 or 2.
6. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the relationship prototype network-based weakly supervised temporal behavior localization method of any one of claims 1 or 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010659078.XA CN111783713B (en) | 2020-07-09 | 2020-07-09 | Weak supervision time sequence behavior positioning method and device based on relation prototype network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010659078.XA CN111783713B (en) | 2020-07-09 | 2020-07-09 | Weak supervision time sequence behavior positioning method and device based on relation prototype network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111783713A CN111783713A (en) | 2020-10-16 |
CN111783713B true CN111783713B (en) | 2022-12-02 |
Family
ID=72758504
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010659078.XA Active CN111783713B (en) | 2020-07-09 | 2020-07-09 | Weak supervision time sequence behavior positioning method and device based on relation prototype network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111783713B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112487913A (en) * | 2020-11-24 | 2021-03-12 | 北京市地铁运营有限公司运营四分公司 | Labeling method and device based on neural network and electronic equipment |
CN112883868B (en) * | 2021-02-10 | 2022-07-15 | 中国科学技术大学 | Training method of weak supervision video motion positioning model based on relational modeling |
CN113408605B (en) * | 2021-06-16 | 2023-06-16 | 西安电子科技大学 | Hyperspectral image semi-supervised classification method based on small sample learning |
CN114333064B (en) * | 2021-12-31 | 2022-07-26 | 江南大学 | Small sample behavior identification method and system based on multidimensional prototype reconstruction reinforcement learning |
CN116030538B (en) * | 2023-03-30 | 2023-06-16 | 中国科学技术大学 | Weak supervision action detection method, system, equipment and storage medium |
CN116563953B (en) * | 2023-07-07 | 2023-10-20 | 中国科学技术大学 | Bottom-up weak supervision time sequence action detection method, system, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018019126A1 (en) * | 2016-07-29 | 2018-02-01 | 北京市商汤科技开发有限公司 | Video category identification method and device, data processing device and electronic apparatus |
CN109034062A (en) * | 2018-07-26 | 2018-12-18 | 南京邮电大学 | A kind of Weakly supervised anomaly detection method based on temporal consistency |
CN110516536A (en) * | 2019-07-12 | 2019-11-29 | 杭州电子科技大学 | A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification |
CN111079646A (en) * | 2019-12-16 | 2020-04-28 | 中山大学 | Method and system for positioning weak surveillance video time sequence action based on deep learning |
WO2020119527A1 (en) * | 2018-12-11 | 2020-06-18 | 中国科学院深圳先进技术研究院 | Human action recognition method and apparatus, and terminal device and storage medium |
-
2020
- 2020-07-09 CN CN202010659078.XA patent/CN111783713B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018019126A1 (en) * | 2016-07-29 | 2018-02-01 | 北京市商汤科技开发有限公司 | Video category identification method and device, data processing device and electronic apparatus |
CN109034062A (en) * | 2018-07-26 | 2018-12-18 | 南京邮电大学 | A kind of Weakly supervised anomaly detection method based on temporal consistency |
WO2020119527A1 (en) * | 2018-12-11 | 2020-06-18 | 中国科学院深圳先进技术研究院 | Human action recognition method and apparatus, and terminal device and storage medium |
CN110516536A (en) * | 2019-07-12 | 2019-11-29 | 杭州电子科技大学 | A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification |
CN111079646A (en) * | 2019-12-16 | 2020-04-28 | 中山大学 | Method and system for positioning weak surveillance video time sequence action based on deep learning |
Non-Patent Citations (2)
Title |
---|
基于弱监督预训练CNN模型的情感分析方法;张越等;《计算机工程与应用》;20180701(第13期);33-39 * |
基于深度学习的视频分类和检测关键技术研究;杨科;《博士学位论文全文库》;20190807;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111783713A (en) | 2020-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111783713B (en) | Weak supervision time sequence behavior positioning method and device based on relation prototype network | |
US11755911B2 (en) | Method and apparatus for training neural network and computer server | |
CN111523621B (en) | Image recognition method and device, computer equipment and storage medium | |
US10127477B2 (en) | Distributed event prediction and machine learning object recognition system | |
US10521734B2 (en) | Machine learning predictive labeling system | |
EP3292492B1 (en) | Predicting likelihoods of conditions being satisfied using recurrent neural networks | |
EP3570220B1 (en) | Information processing method, information processing device, and computer-readable storage medium | |
CN111052128B (en) | Descriptor learning method for detecting and locating objects in video | |
CN114358188A (en) | Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment | |
US11379685B2 (en) | Machine learning classification system | |
CN112948612B (en) | Human body cover generation method and device, electronic equipment and storage medium | |
CN113065013B (en) | Image annotation model training and image annotation method, system, equipment and medium | |
CN115359074B (en) | Image segmentation and training method and device based on hyper-voxel clustering and prototype optimization | |
CN112148831B (en) | Image-text mixed retrieval method and device, storage medium and computer equipment | |
CN114298122B (en) | Data classification method, apparatus, device, storage medium and computer program product | |
CN116150698B (en) | Automatic DRG grouping method and system based on semantic information fusion | |
Jnawali et al. | Automatic classification of radiological report for intracranial hemorrhage | |
CN112749737A (en) | Image classification method and device, electronic equipment and storage medium | |
WO2023087063A1 (en) | Method and system for analysing medical images to generate a medical report | |
CN110704668B (en) | Grid-based collaborative attention VQA method and device | |
CN111814653B (en) | Method, device, equipment and storage medium for detecting abnormal behavior in video | |
CN113392867A (en) | Image identification method and device, computer equipment and storage medium | |
CN117273105A (en) | Module construction method and device for neural network model | |
CN112507912B (en) | Method and device for identifying illegal pictures | |
CN113221662B (en) | Training method and device of face recognition model, storage medium and terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |