CN111783713B - Weak supervision time sequence behavior positioning method and device based on relation prototype network - Google Patents

Weak supervision time sequence behavior positioning method and device based on relation prototype network Download PDF

Info

Publication number
CN111783713B
CN111783713B CN202010659078.XA CN202010659078A CN111783713B CN 111783713 B CN111783713 B CN 111783713B CN 202010659078 A CN202010659078 A CN 202010659078A CN 111783713 B CN111783713 B CN 111783713B
Authority
CN
China
Prior art keywords
behavior
probability
human
optical flow
positioning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010659078.XA
Other languages
Chinese (zh)
Other versions
CN111783713A (en
Inventor
王亮
黄岩
黄林江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202010659078.XA priority Critical patent/CN111783713B/en
Publication of CN111783713A publication Critical patent/CN111783713A/en
Application granted granted Critical
Publication of CN111783713B publication Critical patent/CN111783713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a weak supervision time sequence behavior positioning method and device based on a relation prototype network. In order to solve the problems that the prior art is time-consuming and labor-consuming for training a network model by artificial labeling information and subjective factors are introduced, the invention provides a weak supervision time sequence behavior positioning method based on a relation prototype network, which comprises the steps of dividing a video to be identified into a plurality of video segments according to a preset time interval, inputting a pre-trained behavior positioning model into an optical flow image and the plurality of video segments corresponding to each video segment; determining a first similarity between the human behavior in each video clip and a preset target behavior through a behavior positioning model; and determining the behavior category to which the human behavior belongs in each video clip according to the comparison result of the first similarity and a preset threshold. The method can model the relation between different behaviors, and can make the characteristics of each part of the behavior as close as possible through clustering loss, thereby realizing the positioning of the complete behavior segment.

Description

Weak supervision time sequence behavior positioning method and device based on relation prototype network
Technical Field
The invention relates to the technical field of computer vision, in particular to a weak supervision time sequence behavior positioning method and device based on a relation prototype network.
Background
Behavior localization is an important and challenging classical computer vision task, and has wide application in the fields of security monitoring, intelligent video analysis, video retrieval and the like.
Most of the existing behavior positioning methods adopt a full-supervision mode, and adopt frame-by-frame labeling as supervision information to train a model, but the frame-by-frame labeling is time-consuming and labor-consuming, and artificial labeling usually introduces subjective factors.
Therefore, how to propose a solution to the problems of the prior art is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problems of time and labor consumption for training a network model and introduction of subjective factors caused by artificial labeling information in the prior art, a first aspect of the present invention provides a method for positioning a weakly supervised temporal behavior based on a relationship prototype network, where the method includes:
dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, wherein the behavior positioning model is a model constructed based on a deep convolutional neural network, and performing video behavior positioning based on a preset training sample;
determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model;
and determining the behavior category to which the human behavior in each video clip belongs according to the comparison result of the first similarity and a preset threshold.
In a possible implementation manner, the behavior localization model includes multiple layers of graph convolution layers and pooling layers, and the "first similarity between the human behavior in each video clip and the preset target behavior is determined through the behavior localization model", and the method includes:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
based on preset category characteristics and a preset symbiotic matrix corresponding to a plurality of target behaviors, acquiring second output characteristics corresponding to behavior prototypes corresponding to the human behaviors through a graph volume layer of the behavior positioning model;
based on the first output feature and the second output feature, acquiring Euclidean distances of the first output feature and the second output feature through the behavior positioning model;
and determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.
In a possible implementation manner, before the step of inputting the optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior localization model, the method further includes training the behavior localization model, and the method for training the behavior localization model includes:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
In a possible implementation manner, the loss function corresponding to the second similarity includes:
Figure BDA0002577837280000021
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002577837280000022
represents the value of the loss function corresponding to the second similarity, and delta () represents a conditional function, which is 1 when the condition is satisfied, otherwise, is 0,y i Representing the probability, s, corresponding to the behavior class i i Indicating the ith value of the second similarity.
In a possible implementation manner, after the step of "obtaining, according to the second similarity, a probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior localization model to be trained", the method further includes:
obtaining the deviation of the first probability and the second probability according to the following formula:
Figure BDA0002577837280000031
wherein C represents the number of behavior categories,
Figure BDA0002577837280000032
representing normalized true probability, p, corresponding to behavior class i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
Preferably, after the step of "obtaining the first probability", and before the step of "training the behavior localization model by using a back propagation algorithm and a stochastic gradient descent algorithm", the method further includes obtaining a global loss of the behavior localization model according to a method shown by the following formula:
Figure BDA0002577837280000033
wherein the content of the first and second substances,
Figure BDA0002577837280000034
representing a global penalty of the behavioral localization model to be trained,
Figure BDA0002577837280000035
representing the value of the loss function to which said first probability corresponds, alpha representing an adjustment parameter for adjusting the ratio of the weights of the loss function,
Figure BDA0002577837280000036
the value of the loss function corresponding to the second similarity is represented.
In another aspect of the present invention, a weak supervision timing behavior positioning apparatus based on a relationship prototype network is further provided, including:
the system comprises a first module, a second module and a third module, wherein the first module is used for dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, the behavior positioning model is a model constructed based on a deep convolutional neural network, and video behavior positioning is carried out based on a preset training sample;
the second module is used for determining the first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model;
and the third module is used for determining the behavior category to which the human behavior in each video clip belongs according to the comparison result of the first similarity and a preset threshold.
In one possible implementation, the behavior localization model includes a multilayer graph convolution layer and a pooling layer, and the second module is further configured to:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
based on preset category characteristics and a preset symbiotic matrix corresponding to a plurality of target behaviors, acquiring second output characteristics corresponding to behavior prototypes corresponding to human behaviors in the optical flow image through the convolutional layers of the behavior positioning model;
based on the first output characteristic and the second output characteristic, acquiring Euclidean distance between the first output characteristic and the second output characteristic through the behavior positioning model;
and determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.
In one possible implementation manner, the apparatus further includes a fourth module configured to:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
In a possible implementation manner, the loss function corresponding to the second similarity includes:
Figure BDA0002577837280000041
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002577837280000042
represents the value of the loss function corresponding to the second similarity, and delta () represents a conditional function, which is 1 when the condition is satisfied, otherwise, is 0,y i Representing the probability, s, corresponding to the behavior class i i Indicating the ith value of the second degree of similarity.
In one possible implementation manner, the fourth module is further configured to:
obtaining the deviation of the first probability and the second probability according to the following formula:
Figure BDA0002577837280000051
wherein C represents the number of behavior categories,
Figure BDA0002577837280000052
representing the true probability, p, of the behavior class i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
In another aspect of the present invention, a weak supervision timing behavior positioning apparatus based on a relationship prototype network is further provided, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the relation prototype network-based weakly supervised temporal behavior localization method as described above.
Another aspect of the present invention further provides a non-transitory computer readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the method for weakly supervised temporal behavior localization based on a relation prototype network as described above is implemented.
The method and the device for positioning the weakly supervised time sequence behaviors based on the relation prototype network can model the relation between different behaviors, and can make the characteristics of each part of the behaviors as close as possible through clustering loss, thereby realizing the positioning of complete behavior segments.
Drawings
FIG. 1 is a schematic flow chart of a method for positioning weakly supervised temporal behavior based on a relationship prototype network according to the present invention;
fig. 2 is a schematic structural diagram of the weakly supervised time series behavior localization apparatus based on the relation prototype network of the present invention.
Detailed Description
In order to make the embodiments, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the embodiments are some, but not all embodiments of the present invention. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
Fig. 1 exemplarily shows a flowchart of the weakly supervised temporal behavior localization method based on the relationship prototype network of the present invention. As shown in fig. 1, the method includes:
step S101, dividing a video to be recognized into a plurality of video segments according to a preset time interval, and inputting a pre-trained behavior positioning model into an optical flow image corresponding to each video segment and the plurality of video segments.
In one possible implementation manner, the behavior localization model is a model constructed based on a deep convolutional neural network, and video behavior localization is performed based on preset training samples.
And S102, determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model.
Step S103, determining the behavior category to which the human behavior in each video clip belongs according to the comparison result of the first similarity and a preset threshold.
The method and the device for positioning the weakly supervised time sequence behaviors based on the relation prototype network can model the relation between different behaviors, and can make the characteristics of each part of the behaviors as close as possible through clustering loss, thereby realizing the positioning of complete behavior segments.
In a possible implementation manner, the behavior localization model includes multiple layers of graph convolution layers and pooling layers, and the "first similarity between the human behavior in each video clip and the preset target behavior is determined through the behavior localization model", and the method includes:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
acquiring a second output feature corresponding to a behavior prototype corresponding to the human behavior in the optical flow image through a graph volume layer of the behavior positioning model based on preset category features and a preset co-occurrence matrix corresponding to a plurality of target behaviors;
based on the first output feature and the second output feature, acquiring Euclidean distances of the first output feature and the second output feature through the behavior positioning model;
and determining the similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.
In a possible implementation manner, before the step of inputting the optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior localization model, the method further includes training the behavior localization model, and the method for training the behavior localization model includes:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
In a possible implementation manner, the loss function corresponding to the second similarity includes:
Figure BDA0002577837280000071
wherein the content of the first and second substances,
Figure BDA0002577837280000072
represents the value of the loss function corresponding to the second similarity, and δ () represents a conditional function, which is 1 when the condition is satisfied, otherwise 0,y i Representing the probability, s, corresponding to the behavior class i i Indicating the ith value of the second degree of similarity.
In a possible implementation manner, after the step of "obtaining, according to the second similarity, a probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior localization model to be trained", the method further includes:
obtaining the deviation of the first probability and the second probability according to the following formula:
Figure BDA0002577837280000073
wherein C represents the number of behavior categories,
Figure BDA0002577837280000074
representing behavior classesTrue probability, p, for other i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
The weak supervision time sequence behavior positioning method based on the relation prototype network can also comprise the following steps:
step S0, extracting corresponding optical flow images from 412 videos in the data set, dividing an original video and the optical flow images into a plurality of segments at intervals of 16 frames, and respectively inputting the segments into a model with the same structure (the original video is taken as an example in the following steps);
illustratively, take a behavior locator database as an example, which contains 412 uncut videos.
Step S1, normalizing the output data in the step S0 to be uniform in space size (224 multiplied by 224 pixels), then sending each input segment into a deep convolutional neural network with fixed parameters, wherein the network contains a plurality of layers of three-dimensional convolutional layers, and selecting the output characteristic X of the last average pooling layer (Avg Pool) of the convolutional neural network.
Step S2, converting the characteristic X acquired in the step S1 into a new vector expression X corresponding to each segment by utilizing a 2-layer convolution network (the size of a convolution kernel space is 1 multiplied by 1) e (size T × D).
And S3, counting a symbiotic matrix A of behaviors by utilizing the training set.
And S4, inputting the feature vectors of all behavior types, acquiring a symbiotic matrix A in S3, and obtaining a prototype P corresponding to each behavior by adopting 2-layer graph convolution.
Step S5, obtaining new vector expression X by respectively utilizing S2 and S4 e And the prototype P corresponding to each type, calculating Euclidean distance between the characteristic of each segment and each type of prototype, and taking the negative value of the Euclidean distance as the similarity between the characteristic and the prototype;
step S6, using the new vector expression X obtained in S2 e Obtaining a temporal attention weight λ through 2 convolutional layers t (size TgenerateD);
Step S7, using the attention weight λ obtained in S6 t Expressing X to the new vector obtained in S2 e Carrying out weighted summation to obtain the similarity vector of the whole video for each prototype
Figure BDA0002577837280000081
Step S8, similarity vector in S7
Figure BDA0002577837280000082
Feeding into soft-max layer (along category dimension) to obtain probability distribution p of video for each category, calculating deviation from real category, and obtaining loss as
Figure BDA0002577837280000083
Wherein C represents the number of behavior categories,
Figure BDA0002577837280000084
representing the true probability, p, for the behavior class i i And representing the prediction probability corresponding to the behavior class i.
Step S9, utilizing the similarity vector in S7
Figure BDA0002577837280000085
We constrain this with clustering penalties that make the behavior expression as close as possible to its corresponding prototype
Figure BDA0002577837280000086
Figure BDA0002577837280000087
Wherein δ () is a conditional expression, 1 when the condition is satisfied, otherwise 0,C represents the number of behavior classes, y i Representing the true probability for the behavior class i,
Figure BDA0002577837280000088
similarity vector for prototype
Figure BDA0002577837280000089
The ith element of (2).
Step S10, calculating the global loss
Figure BDA00025778372800000810
Where α is a hyperparameter used to adjust the weight ratio between two losses, typically set to 0.01, respectively.
And S11, reducing the prediction overall error by adopting a back propagation algorithm and a random gradient descent method to train the model, obtaining a final behavior positioning model through multiple iterative training, usually iterating 1000 times on the whole data set, and stopping training when the loss is not converged any more.
S12, extracting corresponding optical flow images from 212 tested videos by using a trained behavior positioning model, normalizing the optical flow images to be uniform in spatial size (224 multiplied by 224 pixels), sending the optical flow images to the trained behavior positioning model to obtain the similarity S of each segment in the videos and the class probability p of the videos, and sending the similarity S to a soft-max layer (along the class dimension) to obtain the similarity S
Figure BDA00025778372800000811
Step S13, updating the learned prototype P by using the prototype updating strategy, as shown in method 1. Then recalculate the vector expression X of the fragment e The Euclidean distance from the updated prototype is obtained by sending the negative number of the Euclidean distance to a soft-max layer (along the time dimension)
Figure BDA0002577837280000091
And in S12
Figure BDA0002577837280000092
Multiplication to obtain
Figure BDA0002577837280000093
Step S14, the video category probability p obtained in S12 is used for rejecting the category less than 0.1, and a preset threshold value (8 multiplied by 10 of the original video) is used -6 Light ofFlow 8X 10 -8 ) For those obtained in S13
Figure BDA0002577837280000094
And performing segmentation to locate the position of the behavior in the video. And (3) suppressing and removing redundant behavior positioning results from the results obtained by the original video and the optical flow by using a non-maximum value to obtain a final positioning result.
Referring to fig. 2, fig. 2 exemplarily shows a structural schematic diagram of the weakly supervised time series behavior localization apparatus based on the relation prototype network of the present invention. As shown in fig. 2, the weakly supervised time series behavior positioning apparatus based on the relationship prototype network provided by the present invention includes:
the system comprises a first module 1, a second module and a third module, wherein the first module is used for dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, the behavior positioning model is a model constructed based on a deep convolutional neural network, and video behavior positioning is carried out based on a preset training sample;
the second module 2 is used for determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior localization model;
and a third module 3, configured to determine, according to a comparison result between the first similarity and a preset threshold, a behavior category to which the human behavior in each video segment belongs.
In a possible implementation manner, the behavior localization model includes a multilayer graph convolution layer and a pooling layer, and the second module 2 is further configured to:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
acquiring a second output feature corresponding to a behavior prototype corresponding to the human behavior in the optical flow image through a graph volume layer of the behavior positioning model based on preset category features and a preset co-occurrence matrix corresponding to a plurality of target behaviors;
based on the first output characteristic and the second output characteristic, acquiring Euclidean distance between the first output characteristic and the second output characteristic through the behavior positioning model;
and determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance.
In one possible implementation manner, the apparatus further includes a fourth module configured to:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
In a possible implementation manner, the loss function corresponding to the second similarity includes:
Figure BDA0002577837280000101
wherein the content of the first and second substances,
Figure BDA0002577837280000102
represents the value of the loss function corresponding to the second similarity, and delta () represents a conditional function, which is 1 when the condition is satisfied, otherwise, is 0,y i Representing the probability, s, corresponding to the behavior class i i Indicating the ith value of the second degree of similarity.
In one possible implementation manner, the fourth module is further configured to:
obtaining the deviation of the first probability and the second probability according to the following formula:
Figure BDA0002577837280000103
wherein C represents the number of behavior categories,
Figure BDA0002577837280000104
representing the true probability, p, for the behavior class i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
Another aspect of the present application further provides a weak supervision timing behavior positioning apparatus based on a relationship prototype network, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the relation prototype network-based weakly supervised temporal behavior localization method as described above.
Another aspect of the present application further provides a non-transitory computer readable storage medium, on which computer program instructions are stored, and when executed by a processor, the computer program instructions implement the weak supervised temporal behavior localization method based on a relation prototype network described above.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In summary, the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (6)

1. A weak supervision time sequence behavior positioning method based on a relation prototype network is characterized by comprising the following steps:
dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, wherein the behavior positioning model is a model constructed based on a deep convolutional neural network, and performing video behavior positioning based on a preset training sample;
determining a first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model; the method comprises the following steps:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
acquiring second output characteristics corresponding to behavior prototypes corresponding to human behaviors in the optical flow image through graph volume layers of the behavior positioning model based on preset category characteristics and preset symbiotic matrixes corresponding to a plurality of target behaviors;
based on the first output feature and the second output feature, acquiring Euclidean distances of the first output feature and the second output feature through the behavior positioning model;
determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance;
determining a behavior category to which the human behavior in each video clip belongs according to a comparison result of the first similarity and a preset threshold;
the behavior positioning method also comprises the step of training the behavior positioning model, and the training method of the pre-trained behavior positioning model comprises the following steps:
converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
2. The method according to claim 1, wherein after the step of obtaining the probability that the human behavior in the optical flow image belongs to the preset target behavior category through the soft-max layer of the behavior localization model to be trained according to the second similarity, the method further comprises:
obtaining the deviation of the first probability and the second probability according to the following formula:
Figure FDA0003913659260000021
wherein C represents the number of behavior categories,
Figure FDA0003913659260000022
representing the true probability, p, for the behavior class i i Representing a prediction probability corresponding to a behavior class i, the first probability representing the optical flowAnd the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
3. A weakly supervised temporal behavior localization apparatus based on a relation prototype network is characterized by comprising:
the system comprises a first module, a second module and a third module, wherein the first module is used for dividing a video to be recognized into a plurality of video segments according to a preset time interval, inputting an optical flow image corresponding to each video segment and the plurality of video segments into a pre-trained behavior positioning model, the behavior positioning model is a model constructed based on a deep convolutional neural network, and video behavior positioning is carried out based on a preset training sample;
the second module is used for determining the first similarity between the human behavior in each video clip and a preset target behavior through the behavior positioning model; the method specifically comprises the following steps:
acquiring first output features corresponding to human behaviors in a plurality of optical flow images through a pooling layer of the behavior positioning model;
acquiring a second output feature corresponding to a behavior prototype corresponding to the human behavior in the optical flow image through a graph volume layer of the behavior positioning model based on preset category features and a preset co-occurrence matrix corresponding to a plurality of target behaviors;
based on the first output feature and the second output feature, acquiring Euclidean distances of the first output feature and the second output feature through the behavior positioning model;
determining a first similarity between the human behavior in each video clip and a preset target behavior according to the Euclidean distance;
a third module, configured to determine, according to a comparison result between the first similarity and a preset threshold, a behavior category to which a human behavior in each video segment belongs;
the device also comprises a fourth module for converting the first output characteristic into a third output characteristic through a multilayer convolution network of a behavior positioning model to be trained according to the first output characteristic;
according to the third output characteristic, acquiring attention weight through a multilayer convolution network of the behavior positioning model to be trained;
carrying out weighted summation on the Euclidean distance and the attention weight to obtain a second similarity of the human behavior in the optical flow image and a preset target behavior;
according to the second similarity, acquiring the probability that the human behavior in the optical flow image belongs to a preset target behavior category through a soft-max layer of the behavior positioning model to be trained;
and training the behavior positioning model through a back propagation algorithm and a random gradient descent algorithm according to the probability that the human behavior in the optical flow image belongs to the preset target behavior category.
4. The apparatus of claim 3, wherein the fourth module is further configured to:
obtaining the deviation of the first probability and the second probability according to the following formula:
Figure FDA0003913659260000031
wherein, C represents the number of behavior categories,
Figure FDA0003913659260000032
representing the true probability, p, of the behavior class i i And the first probability represents the probability that the human behavior in the optical flow image belongs to a preset target behavior category, and the second probability represents the probability that the human behavior in the optical flow image actually belongs to the behavior category.
5. A weakly supervised temporal behavior localization apparatus based on a relation prototype network, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the method for weakly supervised temporal behavior localization based on a relational prototype network according to any of claims 1 or 2.
6. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the relationship prototype network-based weakly supervised temporal behavior localization method of any one of claims 1 or 2.
CN202010659078.XA 2020-07-09 2020-07-09 Weak supervision time sequence behavior positioning method and device based on relation prototype network Active CN111783713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010659078.XA CN111783713B (en) 2020-07-09 2020-07-09 Weak supervision time sequence behavior positioning method and device based on relation prototype network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010659078.XA CN111783713B (en) 2020-07-09 2020-07-09 Weak supervision time sequence behavior positioning method and device based on relation prototype network

Publications (2)

Publication Number Publication Date
CN111783713A CN111783713A (en) 2020-10-16
CN111783713B true CN111783713B (en) 2022-12-02

Family

ID=72758504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010659078.XA Active CN111783713B (en) 2020-07-09 2020-07-09 Weak supervision time sequence behavior positioning method and device based on relation prototype network

Country Status (1)

Country Link
CN (1) CN111783713B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112487913A (en) * 2020-11-24 2021-03-12 北京市地铁运营有限公司运营四分公司 Labeling method and device based on neural network and electronic equipment
CN112883868B (en) * 2021-02-10 2022-07-15 中国科学技术大学 Training method of weak supervision video motion positioning model based on relational modeling
CN113408605B (en) * 2021-06-16 2023-06-16 西安电子科技大学 Hyperspectral image semi-supervised classification method based on small sample learning
CN114333064B (en) * 2021-12-31 2022-07-26 江南大学 Small sample behavior identification method and system based on multidimensional prototype reconstruction reinforcement learning
CN116030538B (en) * 2023-03-30 2023-06-16 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium
CN116563953B (en) * 2023-07-07 2023-10-20 中国科学技术大学 Bottom-up weak supervision time sequence action detection method, system, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018019126A1 (en) * 2016-07-29 2018-02-01 北京市商汤科技开发有限公司 Video category identification method and device, data processing device and electronic apparatus
CN109034062A (en) * 2018-07-26 2018-12-18 南京邮电大学 A kind of Weakly supervised anomaly detection method based on temporal consistency
CN110516536A (en) * 2019-07-12 2019-11-29 杭州电子科技大学 A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning
WO2020119527A1 (en) * 2018-12-11 2020-06-18 中国科学院深圳先进技术研究院 Human action recognition method and apparatus, and terminal device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018019126A1 (en) * 2016-07-29 2018-02-01 北京市商汤科技开发有限公司 Video category identification method and device, data processing device and electronic apparatus
CN109034062A (en) * 2018-07-26 2018-12-18 南京邮电大学 A kind of Weakly supervised anomaly detection method based on temporal consistency
WO2020119527A1 (en) * 2018-12-11 2020-06-18 中国科学院深圳先进技术研究院 Human action recognition method and apparatus, and terminal device and storage medium
CN110516536A (en) * 2019-07-12 2019-11-29 杭州电子科技大学 A kind of Weakly supervised video behavior detection method for activating figure complementary based on timing classification
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于弱监督预训练CNN模型的情感分析方法;张越等;《计算机工程与应用》;20180701(第13期);33-39 *
基于深度学习的视频分类和检测关键技术研究;杨科;《博士学位论文全文库》;20190807;全文 *

Also Published As

Publication number Publication date
CN111783713A (en) 2020-10-16

Similar Documents

Publication Publication Date Title
CN111783713B (en) Weak supervision time sequence behavior positioning method and device based on relation prototype network
US11755911B2 (en) Method and apparatus for training neural network and computer server
CN111523621B (en) Image recognition method and device, computer equipment and storage medium
US10127477B2 (en) Distributed event prediction and machine learning object recognition system
US10521734B2 (en) Machine learning predictive labeling system
EP3292492B1 (en) Predicting likelihoods of conditions being satisfied using recurrent neural networks
EP3570220B1 (en) Information processing method, information processing device, and computer-readable storage medium
CN111052128B (en) Descriptor learning method for detecting and locating objects in video
CN114358188A (en) Feature extraction model processing method, feature extraction model processing device, sample retrieval method, sample retrieval device and computer equipment
US11379685B2 (en) Machine learning classification system
CN112948612B (en) Human body cover generation method and device, electronic equipment and storage medium
CN113065013B (en) Image annotation model training and image annotation method, system, equipment and medium
CN115359074B (en) Image segmentation and training method and device based on hyper-voxel clustering and prototype optimization
CN112148831B (en) Image-text mixed retrieval method and device, storage medium and computer equipment
CN114298122B (en) Data classification method, apparatus, device, storage medium and computer program product
CN116150698B (en) Automatic DRG grouping method and system based on semantic information fusion
Jnawali et al. Automatic classification of radiological report for intracranial hemorrhage
CN112749737A (en) Image classification method and device, electronic equipment and storage medium
WO2023087063A1 (en) Method and system for analysing medical images to generate a medical report
CN110704668B (en) Grid-based collaborative attention VQA method and device
CN111814653B (en) Method, device, equipment and storage medium for detecting abnormal behavior in video
CN113392867A (en) Image identification method and device, computer equipment and storage medium
CN117273105A (en) Module construction method and device for neural network model
CN112507912B (en) Method and device for identifying illegal pictures
CN113221662B (en) Training method and device of face recognition model, storage medium and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant