CN116469155A - Complex action recognition method and device based on learnable Markov logic network - Google Patents
Complex action recognition method and device based on learnable Markov logic network Download PDFInfo
- Publication number
- CN116469155A CN116469155A CN202210027024.0A CN202210027024A CN116469155A CN 116469155 A CN116469155 A CN 116469155A CN 202210027024 A CN202210027024 A CN 202210027024A CN 116469155 A CN116469155 A CN 116469155A
- Authority
- CN
- China
- Prior art keywords
- action
- video
- network
- formula
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009471 action Effects 0.000 title claims abstract description 113
- 238000000034 method Methods 0.000 title claims abstract description 45
- 230000000007 visual effect Effects 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 16
- 230000006870 function Effects 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 2
- 238000011156 evaluation Methods 0.000 claims description 2
- 239000012634 fragment Substances 0.000 claims description 2
- 238000003860 storage Methods 0.000 claims description 2
- 230000008901 benefit Effects 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 244000007853 Sarothamnus scoparius Species 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 230000007704 transition Effects 0.000 description 4
- 238000001514 detection method Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 229920002430 Fibre-reinforced plastic Polymers 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 239000002131 composite material Substances 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 239000011151 fibre-reinforced plastic Substances 0.000 description 2
- 238000011835 investigation Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 238000012614 Monte-Carlo sampling Methods 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000010415 tidying Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Algebra (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a complex action recognition method and a device based on a learnable Markov logic network, comprising the following steps: automatically learning a logic rule set corresponding to each action from the training data by using a strategy network; dividing the video to be detected into a plurality of video segments, and calculating confidence scores aiming at the < action participant, visual relation and object > triples in each video segment; inputting the logic rule set and confidence scores of all triples in a video segment into an improved Markov logic network to obtain the occurrence probability of each action in the video segment; and acquiring an action recognition result of the video to be detected according to the occurrence probability. The method has the advantages of obvious interpretability, good compatibility and high efficiency without depending on the definition of domain experts, and can not only identify the category of the action, but also locate the position of the action in the video clip.
Description
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a complex action recognition method and device based on a learnable Markov logic network.
Background
Motion recognition is a fundamental task in the field of video understanding, and has attracted considerable attention from researchers in the last few years. Recently, with the rapid development of deep learning, 3D convolutional neural networks (3D CNNs) have thoroughly revolutionized this research area. By means of various well-designed network architectures and learning algorithms, it has become the dominant method for video motion recognition tasks. The powerful characterization capabilities of 3D CNNs enable them to better capture complex semantic dependencies across video frames than early work based on low-level features (such as trajectories, keypoints, etc.).
Although these deep neural networks find widespread use in video motion recognition tasks, they still suffer from some inherent drawbacks. In general, the workflow of 3D CNN is as follows: a video clip is input, and after calculation through the multi-layer network, a score is output, wherein the score represents the confidence of each action category. It can be seen that the prediction mechanism of such black box properties does not explicitly provide a relevant basis for identifying an action, such as the time, location, reason for occurrence, etc. of an action in a video. In addition, these deep neural networks are also vulnerable to attack due to lack of interpretability, which greatly limits their application in real-world scenarios with stringent security requirements. More and more research effort in recent years has been devoted to exploring the interpretability of deep learning. Therefore, it is important to develop a highly interpretable action inference framework.
The invention is based on a few research conclusions in cognitive science and neuroscience: i.e. one typically represents a complex event as a combination of atomic units. In addition, recently, there have been also related studies showing that: a complex action may be broken down into a series of spatiotemporal scene graphs that depict how a person interacts with surrounding objects over time. Taking the action of "person wake up in bed" as shown in fig. 1 as an example. To accomplish this, a person may initially lie in the bed, then wake up and sit in the bed. The process can be represented by a change in visual relationship between the person and the bed over time, i.e., from "person-walking on-bed" to "person-walking on-bed". Such characteristics allow the model to explicitly identify the occurrence of actions by detecting transition patterns of visual relationships in the video, thereby significantly improving its interpretability and robustness. To achieve this, the present invention needs to address two key challenges: (1) How to learn this visual relationship transition pattern automatically from the data, rather than by expending a great deal of effort to manually specify the rules. (2) The rules generated by the model often contain some noise information, and how to avoid negative effects caused by the noise, so that efficient action reasoning is performed.
Disclosure of Invention
In order to make up for the lack of the depth model in the interpretability and solve the two challenges presented above, the invention discloses a method and a device for identifying complex actions based on a learnable Markov logic network, which are used for identifying the complex actions in the video by designing a novel interpretable action reasoning framework. To this end, the present invention uses first order logic to model the timing variation of complex actions on semantic states. Specifically, in each logical rule, the visual relationship acts as a corresponding atomic predicate. The logic rules contain rich information and can be automatically generated by a rule policy network through gradually adding relation predicates related to actions. Since the rules are automatically generated and are not carefully defined by the domain expert, they are prone to error. To solve this problem, the present invention utilizes a Markov Logic Network (MLN), which is a statistical relational model that combines a first order logic and a probability map model. The model associates each logical rule with a real-valued weight, thereby measuring the uncertainty of the logical rule. The greater the weight, the more reliable the corresponding rule. In this way, those formulas with noise information can be assigned lower (or even negative) weights, thereby reducing their adverse effects. By using the generated formula and the Markov logic network, the invention can carry out probability logic reasoning and finally determine the occurrence probability of each action.
The technical content of the invention comprises:
a complex action recognition method based on a learnable Markov logic network comprises the following steps:
automatically learning a logic rule set corresponding to each action from the training data by using a strategy network;
dividing the video to be detected into a plurality of video segments, and calculating confidence scores aiming at the < action participant, visual relation and object > triples in each video segment;
inputting the logic rule set and confidence scores of all triples in a video segment into an improved Markov logic network to obtain the occurrence probability of each action in the video segment, wherein the operation relaxation between Boolean variables in the Markov logic network is replaced by a function defined on continuous variables to obtain the improved Markov logic network;
and acquiring an action recognition result of the video to be detected according to the occurrence probability.
Further, a set of logical rules is obtained by:
1) At time t, calculating the relation predicates R obtained at the previous time t-1 Embedded feature x of (2) t-1 ;
2) Will x t-1 And hidden state h t-1 Inputting the data into a gated recurrent neural network GRU;
3) Based on the output of GRU, calculating the relation predicate R at time t t The generation probability of (2);
4) Sampling a specific relation predicate R by using the generated probability t ;
5) Predicates R according to the time relations t Acquiring the sampled probability of formula f;
6) Based on the sampled probabilities, one or more formulas f are put into the formula set of the action to obtain a logic rule set corresponding to the action.
Further, the video to be detected is sliced by the following strategy:
1) Generating a sliding window having a plurality of different sizes;
2) For a sliding window with the size L, setting the sliding step length of the sliding window to be L/2;
3) And cutting the video to be detected according to the sliding step length to generate a video fragment with the length of L.
Further, a < action participant, visual relationship, object > triplet is obtained by:
1) For the video segment, uniformly sampling M video frames;
2) Object o in video frame is detected by using a Faster-RCNN detector with ResNet-101 as backbone network i ;
3) Detecting object o in each video frame i The j-th visual relationship e with all action participants p ij Obtaining the sampling frame<Action participant p, visual relationship e ij Object o i >And (5) triad.
Further, a confidence score is calculated by:
1) For generated triples<Action participant p, visual relationship e ij Object o i >Calculating a confidence score s for an action participant p p Object o i Confidence score of (2)Visual relationship e ij Confidence score->
2) According to confidence score s p Confidence scoreConfidence score->A confidence score for the entire triplet is calculated.
Further, the occurrence probability of each action in the video clip is obtained by:
1) Converting a formula f in a formula rule into a hornclause according to a function defined on a continuous variable and a transformation criterion in first-order logic;
2) Based on the hornclause and the confidence score, each formula f is calculated i Values of the examples;
3) According to formula f i The value of the example gives the formula f i The number n of true values i ;
4) Based on the number n i The probability of occurrence of each action in the video clip is calculated.
Further, the result of motion recognition on the whole video is obtained by performing a maximum pooling operation on the occurrence probability of each motion in each video clip.
Further, the improved Markov logic network and the policy network generating rules for each action are trained by:
1) Rule policy based network pi l Generating a set of logical rulesObtaining improved Markov logic network by maximizing log likelihood method>Weights of (2);
2) Fixed improved Markov logic networkUsing a policy gradient algorithm and updating the rule policy network parameters by maximizing the reward function to obtain a rule policy network pi l+1 Wherein the reward function is an action recognition evaluation index;
3) When a rule policy network pi l Improved Markov logic networkAnd when the set condition is met, obtaining a trained rule strategy network and an improved Markov logic network.
A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method described above when run.
An electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer to perform the method described above.
Compared with the prior art, the invention has the advantages that:
(1) Superior interpretability. Compared with the current popular deep 3D convolutional neural network, the action inference framework provided by the invention has remarkable interpretability, because the logic rule with weight can be used as an important basis for identifying specific actions. In addition, due to the explicit modeling of the time evolution mode of the action, the framework of the invention can not only identify the category of the action, but also locate the position of the action in the video clip.
(2) Definition of field expert without dependence. The rule policy network provided by the invention can automatically learn the logic rule for coding the complex action from the data without manually defining, so that the whole framework has higher robustness. Some existing works that make use of markov logic networks for reasoning often rely on the careful design of the coding event rules by domain experts, which greatly limits their applicability. The nature of automatically mining rules from data also makes the inference framework of the present invention applicable in big data scenarios.
(3) Compatibility and efficiency. The method can be well combined with the existing depth model-based method, and the performance of motion recognition is further improved. In addition, the model can dig out the relation change mode corresponding to the action without excessive training data, and a good prediction result is obtained.
Drawings
FIG. 1 is an example diagram of decomposing an action into a spatiotemporal scene graph.
Fig. 2 is a calculation flow of the whole method.
FIG. 3 is a visualization of rules and corresponding weights learned by a model.
FIG. 4 is a graph of the results of a user survey of the model of the invention.
Detailed Description
The present invention will be described in further detail below with reference to examples and drawings in order to more specifically explain the technical details and advantages of the present invention.
As previously mentioned, complex actions can generally resolve human-object interaction variations over time. In light of this conclusion, the present invention further invents an interpretable action inference framework by modeling the evolution pattern of such visual relationships. As shown in fig. 1, the proposed method mainly consists of two main parts. The first is a regular policy network whose purpose is to mine a set of formulas that are optimal for each action, each of which explicitly represents a particular relational transformation pattern. The second is an action inference module that uses a Markov logic network to perform probabilistic logic inference based on a set of formulas generated by the policy network, calculating the probability of each action occurring. Next, the present invention will explain the implementation details of each module and the training algorithm of the entire framework.
1. Markov logic network
A Markov Logic Network (MLN) is a probabilistic graph model that incorporates logic that utilizes first order logic to define potential functions in conventional markov random fields. In a Markov logic network, each logic formula has a real-valued weight associated with it that represents the importance and reliability of the formula. The more important a formula with higher weight is, the more reliable the knowledge of its coding. Essentially, the Markov logic network releases the hard constraint in the original first order logic, so that some formulas with low reliability and even errors can still be contained: not impossible but less likely.
Specifically, let theRepresents a set of logical formulas, ω i Expression formula->Weight corresponding to-> Is a finite set of constants. Then Markov logic network +.>The following definitions are followed: f (f) i Every possible assignment of each atomic predicate in (a) can be seen as +.>If the corresponding assignment of the logical predicate is true, the value of the binary node is 1, otherwise, the value of the binary node is 0. Each formula f i Each of the possible assignments of (a) acts as a potential function, which has a value of 1 if the formula is true and 0 otherwise. Thus, if the Markov logic network +.>There is an edge between two nodes in (a), if and only if their corresponding logical predicates appear in one formula at the same time. Formula set->Can be seen as one template for building a markov logic network. By definition, the probability corresponding to a state x can be expressed as
Wherein n is i (x) Is formula f i In the different assignments x, the value is the true number. F is a formula setZ is a normalized constant, the value of which is +.>
2. Logic rule generation
Unlike some methods that use manually defined logic formulas, the present invention aims to automatically generate their corresponding logic formulas for each action without relying on any human effort. In particular, the invention is used to model human-object interaction patterns in the form of: r is R 1 ∧...∧R t ...∧R T Wherein R is 1:T Representing the relationship predicates on different frames, T representing the total number of these predicates. Further, the formula f encoding a complex action a can be expressed as:
where A is the predicate representation of action a. Thus, given a particular action predicate A, only the left portion of f needs to be determined. Due to the above formulaComprises only the conjunctive manipulation (,) and can therefore be further represented as a linear sequence +.>By virtue of the definitions above, the generation of equation f translates into a sequence decision process: i.e. predicting an optimal sequence l for each action f . To achieve the whole object, the present invention models this process using a policy network pi. The policy network is used to approximate the probability distribution pi (f|a; θ) that all its possible formulas f should satisfy for action a, where θ is the parameter to which the probability distribution corresponds. Once θ is determined, the present invention can accordingly extract samples from pi (f|a; θ) to construct the sameThe required formula set->The present invention utilizes a gated recurrent neural network (GRU) to express the probability distribution. Specifically, the network may be expressed as:
h t =GRU(x t ,h t-1 ) (3)
wherein x is t Is a relational predicate R t Embedding features at t-th time, h t-1 Representing hidden states corresponding to policy network pi, which merges all past time relationship predicates { R ] 1 ,...,R t-1 Information. In the initial step, the invention uses the feature vector x of the action predicate A 0 Input to pi, then each predicate R t The probability of generation of (2) is calculated by the following equation:
p(R t |R 1 ,...,R t-1 ,A)=softmax(W p h t ) (4)
wherein W is p Is a parameter to be learned from the data. During training, the invention can sample a sequence according to the probabilityTo obtain a specific formula f. Thus, the probability that each equation f is sampled is:
after training the policy network pi, the invention uses the beam search policy to sample k best formulas from the distribution pi (f|a; theta) for each action a as a generated formula set
3. Action reasoning
This section mainly introduces a detailed probabilistic reasoning process for actions. The whole reasoning module mainly comprises three steps (see fig. 1). Next, the present invention will be described separately.
(Step 1) video clip generation based on a sliding window. Given an unclamped long video v, the present invention first processes v using a sliding window mechanism to generate multiple video clips. In view of the wide variety of actions that tend to exhibit large variations over a time span, the present invention sets the size of the sliding window to a variety of different sizes to generate video clips of different lengths. In addition, for a sliding window of size L, the present invention sets its sliding step to L/2 so that each video clip has L/2 frames overlapping adjacent clips. The entire set of video segments generated by the sliding window is denoted as U as an alternative proposal for the possible presence of actions in the video v.
(Step 2) scene graph prediction. For each video clip U e U generated in the previous step, the present invention utilizes a pre-trained scene graph predictor to extract advanced visual information on the video frames. Specifically, the predictor first detects all objects in each frame using a Faster-RCNN detector with ResNet-101 as the backbone network. All possible visual relationships between these objects and the person are then predicted. The generated scene graph may be represented as g= (O, E). Here o= { O 1 ,o 2 ,..} is a set of objects interacted with by an action participant p, e= { { E 11 ,e 12 ,...},{e 21 ,e 22 ,. } represents the visual relationship between a person and an object, where e ij Representing the action participant p and the ith object o i A j-th visual relationship between. Here, due to the diversity of visual interactions, there may be a variety of different types of visual relationships between each participant and the object. Notably, each triplet r ij =<p,e ij ,o i A > can be regarded as a specific instance of its corresponding relationship predicate on a video segment. Further, the example r ij Confidence score of (2)Given by the formula:
s herein p ,Respectively, predicted action executor p and object o i Relationship e between them ij Is given by the scene graph predictor. Considering that the visual relationship between a person and an object hardly changes in several consecutive video frames, if each frame in the segment u of the invention generates a scene graph, it will cause redundancy in computation. Thus, only M frames are uniformly sampled from the segment U ε U to make the prediction described above.
(Step 3) action probability inference. Given a trained Markov networkThe probability of each action a in the video clip can be inferred. According to equation (1), calculating the overall probability requires determining the information about equation f i The number of instances n that are true on segment u i (x) A. The invention relates to a method for producing a fibre-reinforced plastic composite In the original Markov logic network, the value of the logic formula is obtained by carrying out logic operation on binary predicates, and the binary predicates can only take discrete value 0 or 1. However, the relational predicate instance of the present invention employs real values specified in the formula +.>In the range of [0,1 ]]. This property makes it difficult for the present invention to define whether an instance of a formula should take on a value of 1 or 0. To ensure a logical operation with the first order logic +.>Compatible, the present invention uses Lukasiewicz logic to relax the operations between boolean variables into functions defined on continuous variables. Post-relaxation conjunctions (,), disjunctions (V) and negations +.>Can be defined as: x v y=max (0, x+y-1), X v y=min (1, x+y), -X v> Using the above-described relaxation, n can be effectively calculated i (x) A. The invention relates to a method for producing a fibre-reinforced plastic composite Taking the formula to the left of equation 2 as an example, such a formula may be first converted into a Horn clause according to the transformation criteria in the first order logic:
which can be seen as a disjunct between positive or negative predicates.
Then, based on the predicted scene graph on u, each formula instance f i (x) The values of (2) are:
here, theIs the confidence score obtained by equation (6). X is x a Is a binary variable with a value of 0 or 1 indicating whether action a has occurred. Thus, n i (x) By combining the values f of all formula instances i (x) And adding the obtained products. Thereafter, the probability of action a occurring on video clip u is given by:
wherein F is a Is the number of formulas associated with action a, MB x (a) The Markov blanket representing a is a triplet of all formulas that appears with a.
The final prediction result of the whole video v is obtained by performing a maximum pooling operation on the set of segments U.
4. Joint training algorithm
The object of the invention is to learn the most suitable Markov logic network from training dataTo this end, the training scheme designed by the present invention comprises two main phases: rule exploration and weight learning. Due to the discreteness of the rule exploration phase, the policy network pi cannot be directly optimized by back propagation based on the final loss function. Thus, the present invention proposes a joint training strategy. The rule exploration stage is optimized through a strategy gradient algorithm in reinforcement learning, and the corresponding weight of the generated rule is optimized through supervised learning.
Assuming the invention samples a formula f from pi (f|a; θ), the invention can train the rule policy network by maximizing the expectation of the reward function:
J(θ)=E f~π (f|a;θ)[H(f)] (10)
here, H (f) is an index for evaluating the motion recognition performance such as mAP. Further, gradientThe method comprises the following steps:it can be estimated by monte carlo sampling:
where K is the number of samples. In addition, the present invention introduces a baseline b, which is the last few H (f k ) Is an exponential moving average of (a). Thus, the original bonus function in equation (11) is replaced with H (f) k ) -b. In addition, in order to encourage diversity of rule exploration, the invention also adds entropy on pi (f|a; theta)Regularizing into a final resulting loss function.
The weight learning phase aims to learn an appropriate weight for the generated formula, which can be achieved by maximizing the log likelihood:
where N represents the size of a batch of training data, x a Is a binary variable if the ith video v i If action a exists, its value is 1, otherwise it is 0.
The entire training process will alternate between rule exploration and weight learning. First, the present invention utilizes an initialized rule policy network pi to generate a set of formulasAnd (3) performing weight learning, fixing the learned weight, and calculating the motion recognition accuracy to estimate the gradient in the formula (11) so as to update the pi parameter. The invention then uses the updated pi to generate a new +.>And weight training is performed. These two phases will alternate several times.
5. Combination with depth model
An unclamped video typically contains multiple actions, with some potential links between the actions. Taking a video in Charades as an example, there may be some reasonable link between the actions "hold broom", "put broom somewhere" and "clean something on the floor": when a person cleans things on the floor, he may hold the broom and then put it back after finishing. Thus, the method provided by the invention can be used as an inference layer and output of a depth model, so that the identification of the type of motion which is difficult to detect (such as cleaning things on the floor) can be enhanced based on the prediction result of the motion which is easy to detect (such as holding a broom). In particular, the framework of the present invention can be used to learn some logical formulas and corresponding weights to represent the links between these actions. In the reasoning process, the confidence score output by the depth model is given, the detection result with high confidence is regarded as observed evidence, and probability reasoning is carried out on other action categories, so that the detection accuracy is improved.
6. Experimental effect
To fully illustrate the technology of the present invention performed better than the prior art, the present invention performed experiments on two representative experimental data sets, charades and CAD-120. The former is a large video dataset consisting of approximately 9800 unchupped videos, 7,985 of which are used for training and 1,863 of which are used for testing. These videos contain 157 complex daily activities involving 15 different indoor scenes. On average, each video contains 6.8 different action categories, typically multiple action categories in the same frame, which makes identification very challenging. The latter is an RGB-D dataset focused on activities of daily living in humans. It consists of 551 video clips and 32,327 frames, involving 10 different advanced activities (e.g. dining, assembling objects). For Charades, the present invention calculates mAP (Mean Average Precision) to evaluate the detection performance of all action categories. For CAD-120, the present invention uses mAR (Mean Average Recall) metrics to measure whether the model successfully identifies the action performed.
Table 1 shows the results of action recognition on Charades. From the results, the model of the invention reaches 38.4% mAP and exceeds the powerful 3D CNN model, such as I3D, 3D R-101, non-Local and the like. This shows that the model of the present invention can fully utilize the interaction information of actions in the time dimension through the generated formulas and weights thereof. Thanks to pre-training in large video benchmark, the most advanced 3D models (e.g. X3D), while achieving higher performance than the model of the invention, the method of the invention outperforms the depth model pre-trained only on ImageNet (38.4% vs 21.0%). In addition, due to the accuracy limitation of the scene graph predictor, the invention also designs an Oracle version. This version assumes that the visual relationship across all video frames is correctly predicted. As shown at the bottom of Table 1, the Oracle version of the present invention achieved a significant improvement in mAP performance (about 24%) and significantly exceeded all depth models, demonstrating the powerful potential of the method of the present invention. The invention also evaluates integration with depth model SlowFast (R-50). It can be seen that the model of the present invention can further improve the performance of the depth model by exploiting the relationship between the different actions.
Table 1: action recognition performance comparison of different methods under Charades
For CAD-120 datasets, the present invention divides a long video sequence into small segments such that each segment contains only one action, and evaluates the average recall for each action. Although the method Explainable AAR-RAR also employs an interpretable recognition framework, they are based on domain expert-defined transition patterns and perform action reasoning by observing specific state transitions between adjacent two consecutive frames. Compared with the method, the method utilizes the logic rules learned from the real data, and is more robust and efficient.
Table 2: comparison of motion recognition performance under CAD-120 by different methods
The model of the present invention can provide convincing evidence to explain the reasons for making such predictions by using interpretable logic formulas to identify complex actions. Thus, based on the time at which these evidence appears, the present invention can also locate the time at which the action appears in the video. The present invention compares the results of the present method with several advanced depth models on Charades. As can be seen from table 3, the model of the present invention achieves superior motion localization results. The performance of the present invention is better than models that were pre-trained only on ImageNet (20.9%mAP vs 14.2%mAP). In addition, the invention still obtains positioning results similar to the model pre-trained in the Kinetics. Although slightly weaker in mAP performance than the model, the motion localization results of the present invention are more interpretable.
Table 3: action positioning performance comparison of different methods under Charades
To illustrate the interpretability and diversity of the generated logic rules, the present invention illustrates the formulas and associated weights for model learning in FIG. 3. It can be observed from fig. 3 that formulas with higher weights generally provide better evidence of reasoning for actions. For example, "holding bloom → standing on floor → logging at floor" provides clear inferred evidence for detecting action "tidying something on the floor". In addition, the invention also makes user investigation about interpretability. In the user investigation of the present invention, the weights of the model generation formulas are uniformly divided into three categories according to the size, and the rules of each category are respectively represented as good, middle and bad. Next, the invention samples 20 action categories from the chapides and randomly extracts 1 formula from each type of formula as representative of that type. 21 subjects participating in the user survey were reordered from the unordered formulas according to relevance to the action. The statistics of the user survey are shown in fig. 4. As observed, the survey results showed that the learned formula weights exhibited a high degree of consistency with common human knowledge (e.g., 78.75% of good rules were still artificially marked as good).
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the principle and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.
Claims (10)
1. A complex action recognition method based on a learnable Markov logic network comprises the following steps:
automatically learning a logic rule set corresponding to each action from the training data by using a strategy network;
dividing the video to be detected into a plurality of video segments, and calculating confidence scores aiming at the < action participant, visual relation and object > triples in each video segment;
inputting the logic rule set and confidence scores of all triples in a video segment into an improved Markov logic network to obtain the occurrence probability of each action in the video segment, wherein the operation relaxation between Boolean variables in the Markov logic network is replaced by a function defined on continuous variables to obtain the improved Markov logic network;
and acquiring an action recognition result of the video to be detected according to the occurrence probability.
2. The method of claim 1, wherein the set of logical rules is obtained by:
1) At time t, calculating the relation predicates R obtained at the previous time t-1 Embedded feature x of (2) t-1 ;
2) Will x t-1 And hidden state h t-1 Inputting the data into a gated recurrent neural network GRU;
3) Based on the output of GRU, calculating the relation predicate R at time t t The generation probability of (2);
4) Sampling a specific relation predicate R by using the generated probability t ;
5) Predicates R according to the time relations t Acquiring the sampled probability of formula f;
6) Based on the sampled probabilities, one or more formulas f are put into the formula set of the action to obtain a logic rule set corresponding to the action.
3. The method of claim 1, wherein the video to be detected is sliced by the following strategy:
1) Generating a sliding window having a plurality of different sizes;
2) For a sliding window with the size L, setting the sliding step length of the sliding window to be L/2;
3) And cutting the video to be detected according to the sliding step length to generate a video fragment with the length of L.
4. A method as claimed in claim 3, wherein the < action participant, visual relationship, object > triplet is obtained by:
1) For the video segment, uniformly sampling M video frames;
2) Object o in video frame is detected by using a Faster-RCNN detector with ResNet-101 as backbone network i ;
3) Detecting object o in each video frame i The j-th visual relationship e with all action participants p ij Obtaining the sampling frame<Action participant p, visual relationship e ij Object o i >And (5) triad.
5. The method of claim 4, wherein the confidence score is calculated by:
1) For generated triples<Action participant p, visual relationship e ij Object o i >Calculating a confidence score s for an action participant p p Object o i Confidence score of (2)Visual relationship e ij Confidence score->
2) According to confidence score s p Confidence scoreConfidence score->A confidence score for the entire triplet is calculated.
6. The method of claim 1, wherein the probability of occurrence of each action in the video segment is obtained by:
1) Converting a formula f in a formula rule into a hornclause according to a function defined on a continuous variable and a transformation criterion in first-order logic;
2) Based on the hornclause and the confidence score, each formula f is calculated i Values of the examples;
3) According to formula f i The value of the example gives the formula f i The number n of true values i ;
4) Based on the number n i The probability of occurrence of each action in the video clip is calculated.
7. The method of claim 1, wherein the result of motion recognition across the video is obtained by performing a max pooling operation on the probability of occurrence of each motion in each video segment.
8. The method of claim 1, wherein the improved markov logic network and the policy network that generated the rules for each action are trained by:
1) Rule policy based network pi l Generating a set of logical rulesObtaining improved Markov logic network by maximizing log likelihood method>Weights of (2);
2) Fixed improved Markov logic networkUsing a policy gradient algorithm and updating the rule policy network parameters by maximizing the reward function to obtain a rule policy network pi l+1 Wherein the reward function is an action recognition evaluation index;
3) When a rule policy network pi l Improved Markov logic networkAnd when the set condition is met, obtaining a trained rule strategy network and an improved Markov logic network.
9. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1-8 when run.
10. An electronic device comprising a memory, in which a computer program is stored, and a processor arranged to run the computer program to perform the method of any of claims 1-8.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210027024.0A CN116469155A (en) | 2022-01-11 | 2022-01-11 | Complex action recognition method and device based on learnable Markov logic network |
PCT/CN2022/135767 WO2023134320A1 (en) | 2022-01-11 | 2022-12-01 | Complex action recognition method and apparatus based on learnable markov logic network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210027024.0A CN116469155A (en) | 2022-01-11 | 2022-01-11 | Complex action recognition method and device based on learnable Markov logic network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116469155A true CN116469155A (en) | 2023-07-21 |
Family
ID=87175819
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210027024.0A Pending CN116469155A (en) | 2022-01-11 | 2022-01-11 | Complex action recognition method and device based on learnable Markov logic network |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116469155A (en) |
WO (1) | WO2023134320A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117612072A (en) * | 2024-01-23 | 2024-02-27 | 中国科学技术大学 | Video understanding method based on dynamic space-time diagram |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117434989B (en) * | 2023-12-20 | 2024-03-12 | 福建省力得自动化设备有限公司 | System and method for regulating and controlling environment in electrical cabinet |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9230159B1 (en) * | 2013-12-09 | 2016-01-05 | Google Inc. | Action recognition and detection on videos |
CN103942575A (en) * | 2014-04-02 | 2014-07-23 | 公安部第三研究所 | System and method for analyzing intelligent behaviors based on scenes and Markov logic network |
CN105894017B (en) * | 2016-03-28 | 2019-05-21 | 中山大学 | Online activity recognition methods and system based on Markov Logic Network |
CN110188596A (en) * | 2019-01-04 | 2019-08-30 | 北京大学 | Monitor video pedestrian real-time detection, Attribute Recognition and tracking and system based on deep learning |
CN112131944B (en) * | 2020-08-20 | 2023-10-17 | 深圳大学 | Video behavior recognition method and system |
-
2022
- 2022-01-11 CN CN202210027024.0A patent/CN116469155A/en active Pending
- 2022-12-01 WO PCT/CN2022/135767 patent/WO2023134320A1/en unknown
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117612072A (en) * | 2024-01-23 | 2024-02-27 | 中国科学技术大学 | Video understanding method based on dynamic space-time diagram |
CN117612072B (en) * | 2024-01-23 | 2024-04-19 | 中国科学技术大学 | Video understanding method based on dynamic space-time diagram |
Also Published As
Publication number | Publication date |
---|---|
WO2023134320A1 (en) | 2023-07-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Geiger et al. | Tadgan: Time series anomaly detection using generative adversarial networks | |
Mukhoti et al. | Deterministic neural networks with inductive biases capture epistemic and aleatoric uncertainty | |
Kim et al. | Disentangling by factorising | |
Huang et al. | Scallop: From probabilistic deductive databases to scalable differentiable reasoning | |
Bell | The co-information lattice | |
Bengoetxea et al. | Inexact graph matching by means of estimation of distribution algorithms | |
CN111079931A (en) | State space probabilistic multi-time-series prediction method based on graph neural network | |
Shi et al. | Effective decoding in graph auto-encoder using triadic closure | |
WO2023134320A1 (en) | Complex action recognition method and apparatus based on learnable markov logic network | |
CN112633478A (en) | Construction of graph convolution network learning model based on ontology semantics | |
Langseth et al. | Fusion of domain knowledge with data for structural learning in object oriented domains | |
CN113705099B (en) | Social platform rumor detection model construction method and detection method based on contrast learning | |
Zhang et al. | An intrusion detection method based on stacked sparse autoencoder and improved gaussian mixture model | |
Xie et al. | Meta learning with relational information for short sequences | |
CN117349494A (en) | Graph classification method, system, medium and equipment for space graph convolution neural network | |
Van Allen et al. | Quantifying the uncertainty of a belief net response: Bayesian error-bars for belief net inference | |
Kang et al. | Interpretability for reliable, efficient, and self-cognitive DNNs: From theories to applications | |
Ajoodha et al. | Tracking influence between naïve bayes models using score-based structure learning | |
CN113989544A (en) | Group discovery method based on deep map convolution network | |
CN117456037A (en) | Information propagation diagram generation method and device | |
Samel et al. | Active deep learning to tune down the noise in labels | |
Amirkhani et al. | Expectation maximization based ordering aggregation for improving the K2 structure learning algorithm | |
Ren et al. | Variational flow graphical model | |
Yang et al. | Monotonicity induced parameter learning for bayesian networks with limited data | |
Mirtaheri et al. | Tensor-based method for temporal geopolitical event forecasting |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |