WO2023134320A1

WO2023134320A1 - Complex action recognition method and apparatus based on learnable markov logic network

Info

Publication number: WO2023134320A1
Application number: PCT/CN2022/135767
Authority: WO
Inventors: 金阳; 穆亚东
Original assignee: 北京大学
Priority date: 2022-01-11
Filing date: 2022-12-01
Publication date: 2023-07-20
Also published as: CN116469155A

Abstract

A complex action recognition method and apparatus based on a learnable Markov logic network. The method comprises: automatically learning, from training data and by using a policy network, a logic rule set which corresponds to each action; segmenting a video to be tested into a plurality of video clips, and calculating a confidence score for a triple <action participant, visual relationship and object> in each video clip; inputting, into an improved Markov logic network, the logic rule set and the confidence scores of all the triples in one video clip, so as to obtain the probability of occurrence of each action in the video clip; and acquiring an action recognition result of said video according to the probability of occurrence. The method has significant interpretability, good compatibility and high efficiency, such that the category of an action can be recognized, and the position of the action in a video clip can also be positioned.

Description

Complex action recognition method and device based on learnable Markov logic network

technical field

The invention belongs to the field of computer vision, and in particular relates to a complex action recognition method and device based on a learnable Markov logic network.

Background technique

Action recognition is a fundamental task in the field of video understanding that has attracted considerable attention from researchers over the past few years. Recently, with the rapid development of deep learning, 3D convolutional neural network (3D CNN) has revolutionized this research field. Relying on various well-designed network architectures and learning algorithms, it has become a mainstream method for video action recognition tasks. Compared with earlier works based on low-level features (e.g., trajectories, keypoints, etc.), the powerful representation capabilities of 3D CNNs enable them to better capture complex semantic dependencies across video frames.

Although these deep neural networks have achieved widespread applications in video action recognition tasks, they still have some inherent defects. Generally speaking, the workflow of 3D CNN is as follows: input a video clip, after calculation by multi-layer network, output a score, which represents the confidence of each action category. It can be seen that this black-box prediction mechanism does not explicitly provide relevant basis for identifying an action, such as when the action occurs in the video, where it occurs, why it occurs, and so on. In addition, these deep neural networks are also vulnerable to adversarial attacks due to lack of interpretability, which greatly limits their application in real-world scenarios with stringent security requirements. In recent years, more and more research efforts have been devoted to exploring the interpretability of deep learning. Therefore, it is particularly important to develop an action reasoning framework with high interpretability.

The present invention is based on some research conclusions in cognitive science and neuroscience: that is, people usually express a complex event as a combination of some atomic units. In addition, recent related research has shown that a complex action can be broken down into a series of spatiotemporal scene graphs, which depict how a person interacts with surrounding objects over time. Take the action "the person wakes up in bed" shown in Figure 1 as an example. To perform this maneuver, a person may initially lie in bed, then wake up and sit up on the bed. This process can be represented by the temporal change of the visual relationship between the person and the bed, that is, from “person-lying on-bed” to “person-sitting on-bed”. Such properties allow the model to explicitly recognize action occurrences by detecting transition patterns of visual relationships in videos, thereby significantly improving its interpretability and robustness. In order to realize this idea, the present invention needs to solve two key challenges: (1) How to automatically learn this visual relationship transformation mode from data, instead of manually specifying these rules by consuming a lot of effort. (2) The rules generated by the model often contain some noise information, how to avoid the negative impact of these noises, so as to perform efficient action reasoning.

Contents of the invention

In order to make up for the lack of interpretability of the deep model and solve the two challenges raised above, the present invention discloses a complex action recognition method and device based on a learnable Markov logic network. By designing a novel An interpreted action reasoning framework to recognize complex actions in videos. To this end, the present invention uses first-order logic to model temporal changes in semantic states of complex actions. Specifically, in each logical rule, visual relations act as corresponding atomic predicates. These logical rules contain rich information and can be automatically generated by a rule-policy network by gradually adding action-related relational predicates. Since rules are automatically generated and not carefully defined by domain experts, they are prone to errors. To solve this problem, the present invention utilizes a Markov Logic Network (MLN), which is a statistical relational model combining first-order logic and a probabilistic graphical model. The model associates each logical rule with a real-valued weight, thereby measuring the uncertainty of the logical rule. If the weight is larger, then the corresponding rule is more reliable. In this way, lower (or even negative) weights can be assigned to formulas with noise information, thereby reducing their adverse effects. Utilizing the generated formula and the Markov logic network, the present invention can perform probabilistic logic reasoning, and finally determine the occurrence probability of each action.

Technical contents of the present invention include:

A complex action recognition method based on a learnable Markov logic network, the steps of which include:

Use a policy network to automatically learn the set of logical rules corresponding to each action from the training data;

Divide the video to be detected into several video clips, and calculate the confidence score for the <action participant, visual relationship, object> triplet in each video clip;

Input the set of logical rules and the confidence scores of all triples in a video clip into an improved Markov logic network to obtain the probability of occurrence of each action in the video clip, wherein the Markov logic network in Operational relaxation between Boolean variables is replaced by a function defined on continuous variables to obtain the improved Markov logic network;

According to the occurrence probability, an action recognition result of the video to be detected is acquired.

Further, the logical rule set is obtained through the following steps:

1) At time t, calculate the embedded feature x _t-1 of the relational predicate R _t-1 obtained at the previous time;

2) Input x _t-1 and hidden state h _t-1 into a gated recurrent neural network GRU;

3) Calculate the generation probability of the relational predicate R _t at time t according to the output of the GRU;

4) Utilize the generation probability to sample a specific relational predicate R _t ;

5) According to the generation probability of the relational predicate R _t at each moment, the sampled probability of the formula f is obtained;

6) Based on the sampled probability, put one or more formulas f into the formula set of the action to obtain the logic rule set corresponding to the action.

Further, the video to be detected is segmented through the following strategies:

1) Generate sliding windows with multiple different sizes;

2) For a sliding window whose size is L, set the sliding step of the sliding window to L/2;

3) According to the sliding step, the video to be detected is cut to generate a video segment with a length of L.

Further, the <action participant, visual relationship, object> triplet is obtained through the following steps:

1) For the video segment, evenly sample M video frames;

2) Use the Faster-RCNN detector with ResNet-101 as the backbone network to detect the object o _i in the video frame;

3) Detect the jth visual relationship e _ij between the object o _i and all action participants p in each video frame, and obtain the <action participant p, visual relationship e _ij , object o _i > triplet on the sampling frame.

Further, the confidence score is calculated by the following steps:

1) For the generated triplet <action participant p, visual relation e _ij , object o _i >, calculate the confidence score s _p of the action participant p and the confidence score of the object o _i

and the confidence score of the visual relationship e _ij

2) According to the confidence score s _p , the confidence score

with confidence score

Compute the confidence score for the entire triplet.

Further, the probability of occurrence of each action in the video clip is obtained through the following steps:

1) Transform the formula f in the formula rule into a Horn clause according to the transformation criterion in the function defined on the continuous variable and first-order logic;

2) based on the Horn clause and the confidence score, calculate the value of each formula f _i instance;

3) According to the value of the formula f _i instance, obtain the true quantity n _i of the formula f _i ;

4) Based on the number n _i , calculate the occurrence probability of each action in the video segment.

Further, by performing a maximum pooling operation on the occurrence probability of each action in each video clip, the result of action recognition on the entire video is obtained.

Further, the improved Markov logic network and the policy network that generates the rules for each action are trained through the following steps:

1) Generate a set of logical rules based on the rule policy network π ^l

Obtaining Improved Markov Logic Networks by Maximizing the Log-Likelihood Method

the weight of;

2) Fixed improved Markov logic network

, using the policy gradient algorithm and updating the rule-policy network parameters by maximizing the reward function to obtain the rule-policy network π ^l+1 , wherein the reward function is an action recognition evaluation index;

3) When the rule-policy network π ^l and the improved Markov logic network

When the set conditions are met, the trained rule policy network and the improved Markov logic network are obtained.

A storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned method when running.

An electronic device includes a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer to execute the above-mentioned method.

Compared with prior art, the advantages of the present invention are:

(1) Superior interpretability. Compared with the popular deep 3D convolutional neural network, the action inference framework proposed by the present invention has remarkable interpretability, because the logical rules with weights can be used as an important basis for identifying specific actions. Furthermore, thanks to the explicit modeling of the temporal evolution patterns of actions, our framework can not only identify the category of the action, but also localize its location in the video clip.

(2) Definition of domain experts without reliance. The rule-policy network proposed by the present invention can automatically learn logical rules for encoding complex actions from data without manual definition, making the whole framework more robust. Existing works using Markov logic networks for reasoning often rely on the careful design of rules for encoding events by domain experts, which greatly limits their applicability. The feature of automatically mining rules from data also makes the reasoning framework of the present invention applicable to big data scenarios.

(3) Compatibility and efficiency. This method can be well combined with the existing deep model-based methods to further improve the performance of action recognition. In addition, the model of the present invention can dig out the relationship change pattern corresponding to the action without too much training data, and obtain better prediction results.

Description of drawings

Figure 1 is an example diagram of decomposing an action into a spatio-temporal scene graph.

Figure 2 is the calculation flow of the whole method.

Figure 3 is a visualization of the rules and corresponding weights learned by the model.

Fig. 4 is a graph of user survey results of the model of the present invention.

Detailed ways

In order to more specifically illustrate the technical details and advantages of the present invention, the present invention will be further described in detail below through the embodiments and accompanying drawings.

As mentioned earlier, complex actions can often decompose human-object interaction changes over time. Inspired by this conclusion, the present invention invents an interpretable action reasoning framework by modeling the evolution pattern of such visual relationships. As shown in Fig. 1, the method proposed by the present invention mainly consists of two main parts. The first is the rule-policy network, which aims to mine the optimal set of formulas for each action, where each formula explicitly expresses a specific relational transformation pattern. The second is the action reasoning module, which is based on the formula set generated by the policy network, uses the Markov logic network to perform probabilistic logic reasoning, and calculates the probability of each action occurring. Next, the present invention will elaborate the implementation details of each module and the training algorithm of the whole framework.

1. Markov logic network

Markov logic network (MLN) is a probabilistic graphical model incorporating logic, which uses first-order logic to define potential functions in traditional Markov random fields. In Markov Logic Networks, each logical formula has a real-valued weight associated with it, indicating the importance and reliability of the formula. A formula with a higher weight tends to be more important, and the knowledge it encodes is more reliable. In essence, the Markov logic network removes the hard constraints in the original first-order logic, so that some formulas with low reliability or even errors can still be included: it is not impossible but less likely.

Specifically, let

Represents a set of logical formulas, ω _i represents the formula

The corresponding weight,

is a limited set of constants. Then, the Markov logic network

Following the definition: every possible assignment of every atomic predicate in f _i can be seen as

A binary node in , if the assignment corresponding to the logical predicate is true, then the value of the binary node is 1, otherwise it is 0. Each possible assignment of each formula f _i acts as a potential function whose value is 1 if the formula is true and 0 otherwise. Therefore, if the Markov logic network

An edge exists between two nodes in if and only if their corresponding logical predicates appear in a formula at the same time. formula collection

It can be regarded as a template for constructing Markov logic network. Through this definition, the probability corresponding to a state x can be expressed as

Where n _i (x) is the number of true values of the formula f _i in different assignments x. F is the formula collection

The size of , Z is a normalization constant whose value is

2. Logic rule generation

Unlike some methods that use human-defined logic formulas, the goal of the present invention is to automatically generate its corresponding logic formula for each action without relying on any human power. Specifically, the present invention is used to model the human-object interaction mode with the following form: R ₁ ∧…∧R _t …∧R _T , where R _1:T represents the relationship predicates on different frames, and T represents the relationship predicates of these predicates total. Furthermore, the formula f encoding a complex action a can be expressed as:

where A is the predicate representation of action a. Thus, given a particular action predicate A, only the left part of f needs to be determined. Due to the above formula

Contains only conjunction operations (∧), so it can be further expressed as a linear sequence

Relying on the above definitions, the generation of the formula f is transformed into a sequence decision process: that is, to predict a most suitable sequence l _f for each action. To achieve the overall purpose, the present invention uses a policy network π to model this process. The policy network is used to approximate the probability distribution π(f|a; θ) that all possible formulas f should satisfy for action a, where θ is the parameter corresponding to the probability distribution. Once θ is determined, the present invention can correspondingly draw some samples from π(f|a; θ) to form the required formula set

. The present invention utilizes a Gated Recurrent Neural Network (GRU) to express this probability distribution. Specifically, the network can be expressed as:

h _t =GRU(x _t h _t-1 )#(3)

where x _t is the embedded feature of the relational predicate R _t at time t-th, h _t-1 represents the hidden state corresponding to the policy network π, which incorporates all past time relational predicates {R ₁ ,…,R _t-1 } information. In the initial step, the present invention inputs the feature vector _x0 of the action predicate A into π, and then the generation probability of each predicate R _t is calculated by the following formula:

p(R _t |R ₁ ,...,R _t-1 ,A)＝softmax(W _p h _t )#(4)

where W _p is the parameter to be learned from the data. During the training process, the present invention can sample a sequence according to the above probability

To get a specific formula f. Therefore, the probability that each formula f is sampled is:

After training the policy network π, the present invention uses the beam search strategy to sample k best formulas for each action a from the distribution π(f|a; θ) as the generated formula set

.

3. Action reasoning

This section mainly introduces the detailed probabilistic inference process for actions. The whole reasoning module mainly includes three steps (see Figure 1). Next, the present invention will introduce them respectively.

(Step1) Generation of video clips based on sliding windows. Given a long uncut video v, the present invention first utilizes a sliding window mechanism to process v to generate multiple video segments. In view of the fact that different types of actions often show great changes in the time span, the present invention sets the size of the sliding window to a variety of different sizes to generate video clips of different lengths. In addition, for a sliding window whose size is L, the present invention sets its sliding step to L/2, so that each video segment has L/2 frames overlapping with adjacent segments. Denote the set of all video clips generated by the sliding window as U as candidate proposals for possible actions in video v.

(Step2) Scene graph prediction. For each video segment u ∈ U generated in the previous step, the present invention utilizes a pre-trained scene graph predictor to extract high-level visual information on video frames. Specifically, the predictor first detects all objects in each frame using a Faster-RCNN detector with ResNet-101 as the backbone network. Next, all possible visual relationships between these objects and people are predicted. The generated scene graph can be expressed as G=(O,E). Here O={o ₁ ,o ₂ ,…} is the set of objects interacting with the action participant p, E={{e ₁₁ ,e ₁₂ ,…},{e ₂₁ ,e ₂₂ ,…}} means the person and the visual relationship between objects, where e _ij represents the j-th visual relationship between the action participant p and the i-th object o _i . Here, due to the diversity of visual interactions, there may be many different types of visual relationships between each participant and object. It is worth noting that each triple r _ij =<p,e _ij ,o _i > can be regarded as a specific instance of its corresponding relational predicate on the video segment. In addition, the confidence score of the instance r _ij

is given by:

here _sp ,

are respectively the confidence scores of the predicted action performer p, object o _i and the relationship e _ij between them, given by the scene graph predictor. Considering that the visual relationship between people and objects hardly changes in several consecutive video frames, if a scene graph is generated for each frame in segment u in the present invention, it will cause redundant computation. Therefore, only M frames are uniformly sampled from segment u ∈ U for the above prediction.

(Step3) Action probability inference. Given a trained Markov network

The probability of each action a in the video segment can then be inferred. According to formula (1), calculating the overall probability requires determining the number n _i (x) of instances where the formula f _i is true on segment u. In the original Markov logic network, the value of the logic formula is obtained by logical operation on the binary predicate, and the binary predicate can only take the discrete value 0 or 1. However, the relational predicate instances of the present invention take the real values specified in the formula

Its range is [0,1]. This property makes it difficult for the present invention to define whether a formula instance should take a value of 1 or 0. To ensure logical operations in first-order logic with

compatible, the present invention uses

Logic relaxes operations between Boolean variables into functions defined on continuous variables. Conjunction (∧), Disjunction (∨) and Negation after Relaxation

Can be defined as: X∨Y＝max(0,X+Y-1), X∨Y＝min(1,X+Y),

Using the relaxation described above, _ni (x) can be computed efficiently. Taking the formula on the left side of Equation 2 as an example, according to the transformation criterion in first-order logic, such a formula can be transformed into a Horn clause first:

It can be seen as a disjunction between positive or negative predicates.

Then, based on the predicted scene graph on u, the value of each formula instance f _i (x) is:

here

is the confidence score obtained by formula (6). x _a is a binary variable whose value is 0 or 1, indicating whether action a occurred or not. Therefore, n _i (x) can be obtained by adding the values f _i (x) of all formula instances. Thereafter, the probability of action a occurring on video segment u is given by:

where F _a is the number of formulas associated with action a, and MB _x (a) represents the Markov blanket of a, which is the triplet that occurs with a in all formulas.

The final prediction for the entire video v is obtained by performing a max-pooling operation on the set of segments U.

4. Joint training algorithm

The goal of the present invention is to learn the most suitable Markov logic network from the training data

To this end, the training scheme designed by the present invention includes two main stages: rule exploration and weight learning. Due to the discrete nature of the rule exploration phase, the policy network π cannot be directly optimized by backpropagation based on the final loss function. Therefore, the present invention proposes a joint training strategy. Among them, the policy gradient algorithm in reinforcement learning is used to optimize the rule exploration stage, and the corresponding weight of the generated rules is optimized through supervised learning.

Assuming that the present invention samples a formula f from π(f|a; θ), then the present invention can train the rule-policy network by maximizing the expectation of the reward function:

J(θ)＝E _{f～π(f|a;θ)} [H(f)]#(10)

Here H(f) is an index for evaluating action recognition performance such as mAP. Further, the gradient

for:

It can be estimated by Monte Carlo sampling:

Here K is the number of samples. In addition, the present invention also introduces a baseline b, which is the exponential moving average of the recent H(f _k ). In this way, the original reward function in Equation (11) is replaced by H(f _k )-b. In addition, in order to encourage the diversity of rule exploration, the present invention also adds entropy regularization on π(f|a; θ) to the final loss function.

The weight learning phase aims to learn an appropriate weight for the generated formula, which can be achieved by maximizing the log-likelihood:

Here N represents the size of a batch of training data, and x _a is a binary variable whose value is 1 if there is an action a in the ith video v _i , otherwise it is 0.

The whole training process will alternate between rule exploration and weight learning. First, the present invention utilizes the initialized rule-policy network π to generate a set of formulas

Carry out weight learning, then fix the learned weights, and calculate the action recognition accuracy to estimate the gradient in formula (11) to update the parameters of π. After that, the present invention utilizes the updated π to generate new

, and perform weight training. These two phases will alternate several times.

5. Combination with deep model

An unedited video usually contains multiple actions, and there may be some potential connections between these actions. Taking a video in Charades as an example, there might be some reasonable connection between the actions "hold the broom", "put the broom somewhere" and "clean up something on the floor": when a person is cleaning something on the floor , he might take the broom and put the broom back when he's done tidying up. Therefore, the method proposed in the present invention can be used as an inference layer combined with the output of the deep model, so that it can enhance the detection of difficult-to-detect action categories (such as: cleaning the floor) based on the prediction results of easy-to-detect actions (such as: holding a broom). on things) identification. Specifically, the framework of the present invention can be used to learn some logical formulas and corresponding weights to represent the connections between these actions. In the reasoning process, given the confidence score output by the deep model, the present invention regards the detection results with high confidence as observed evidence, and performs probabilistic reasoning on other action categories, thereby improving its detection accuracy.

6. Experimental effect

In order to fully illustrate that the technology of the present invention performs better than the existing technology, the present invention conducts experiments on two representative experimental data sets Charades and CAD-120. The former is a large video dataset consisting of about 9,800 uncut videos, of which 7,985 are used for training and 1,863 for testing. The videos contained 157 complex daily activities involving 15 different indoor scenarios. On average, each video contains 6.8 different action categories, and often there are multiple action categories in the same frame, making recognition extremely challenging. The latter is an RGB-D dataset focusing on human activities of daily living. It consists of 551 video clips and 32,327 frames involving 10 different high-level activities (e.g. eating, assembling objects). For Charades, the present invention calculates mAP (Mean Average Precision) to evaluate the detection performance of all action categories. For CAD-120, the present invention adopts the mAR (Mean Average Recall) index to measure whether the model successfully recognizes the executed action.

Table 1 shows the action recognition results on Charades. It can be seen that the model of the present invention achieves 38.4% mAP and surpasses powerful 3D CNN models such as I3D, 3D R-101 and Non-Local etc. This shows that the model of the present invention can make full use of the interaction information of actions in the time dimension through the generated formulas and their weights. Thanks to pre-training on the large-scale video benchmark Kinetics, state-of-the-art 3D models (e.g. X3D) achieve higher performance than our model, but our method exceeds the depth of pre-training on ImageNet alone model (38.4% vs 21.0%). In addition, due to the limitation of the accuracy of the scene graph predictor, the present invention also designs an Oracle version. This version assumes that visual relationships are correctly predicted on all video frames. As shown at the bottom of Table 1, the oracle version of the present invention achieves a significant improvement (about 24%) in mAP performance, and surpasses all deep models by a large margin, which demonstrates the strong potential of our method. The present invention also evaluates the integration with the deep model SlowFast (R-50). It can be seen that the model of the present invention can further improve the performance of deep models by exploiting the relationship between different actions.

Table 1: Comparison of action recognition performance of different methods under Charades

For the CAD-120 data set, the present invention divides a long video sequence into small segments, so that each segment contains only one action, and evaluates the average recall rate of each action. As shown in Table 2, the model of the present invention is in mAR performance obtained the best results. Although the method Explainable AAR-RAR also adopts an explainable recognition framework, they are based on transition patterns defined by domain experts and perform action inference by observing specific state transitions between adjacent two consecutive frames. In contrast, the model of the present invention utilizes logical rules learned from real data, and is more robust and efficient.

Table 2: Comparison of action recognition performance of different methods under CAD-120

The model of the present invention can provide convincing evidence for the reason for making such predictions by utilizing interpretable logical formulas to recognize complex actions. Therefore, based on when these evidences occur, the present invention can also locate when the action occurs in the video. The present invention compares the results of the method of the present invention with several advanced deep models on Charades. As can be seen from Table 3, the model of the present invention achieves superior action localization results. Compared with the model pre-trained on ImageNet only, the performance of the present invention is better (20.9% mAP vs 14.2% mAP). In addition, the present invention still obtains a positioning result close to that of the pre-trained model in Kinetics. Although slightly weaker than this model in mAP performance, the action localization results of our invention are more interpretable.

Table 3: Comparison of action localization performance of different methods under Charades

In order to illustrate the interpretability and diversity of the generated logic rules, the present invention illustrates the model learning formulas and related weights in FIG. 3 . From Figure 3, it can be observed that formulas with higher weights generally provide better inference evidence for actions. For example, "holding broom→standing on floor→looking at floor" provides a clear inferred evidence for the detection action "tidying something on the floor". In addition, the present invention conducts a user survey on interpretability. In the user survey of the present invention, the weights of the model generation formulas are evenly divided into three categories according to the size, and the rules of each category are correspondingly represented as good, middle and bad. Next, the present invention samples 20 action categories from Charades, and randomly selects one formula from each type of formula as the representative of this type. The 21 subjects who participated in the user survey reordered the formulas after reordering according to their relevance to actions. The statistical results of the user survey are shown in Figure 4. As observed, the findings show a high degree of agreement between the learned formula weights and human common sense (e.g., 78.75% of good rules are still labeled as good by humans).

The above embodiments are only used to illustrate the technical solution of the present invention and not to limit it. Those skilled in the art can modify or equivalently replace the technical solution of the present invention without departing from the principle and scope of the present invention. The scope of protection should be determined by the claims.

Claims

A complex action recognition method based on a learnable Markov logic network, the steps of which include:

Use a policy network to automatically learn the set of logical rules corresponding to each action from the training data;

Divide the video to be detected into several video clips, and calculate the confidence score for the <action participant, visual relationship, object> triplet in each video clip;

Input the set of logical rules and the confidence scores of all triples in a video clip into an improved Markov logic network to obtain the probability of occurrence of each action in the video clip, wherein the Markov logic network in Operational relaxation between Boolean variables is replaced by a function defined on continuous variables to obtain the improved Markov logic network;

According to the occurrence probability, an action recognition result of the video to be detected is acquired.
The method according to claim 1, wherein the logical rule set is obtained through the following steps:

1) At time t, calculate the embedded feature x t-1 of the relational predicate R t-1 obtained at the previous time;

2) Input x t-1 and hidden state h t-1 into a gated recurrent neural network GRU;

3) Calculate the generation probability of the relational predicate R t at time t according to the output of the GRU;

4) Utilize the generation probability to sample a specific relational predicate R t ;

5) According to the generation probability of the relational predicate R t at each moment, the sampled probability of the formula f is obtained;

6) Based on the sampled probability, put one or more formulas f into the formula set of the action to obtain the logic rule set corresponding to the action.
The method according to claim 1, wherein the video to be detected is segmented through the following strategies:

1) Generate sliding windows with multiple different sizes;

2) For a sliding window whose size is L, set the sliding step of the sliding window to L/2;

3) According to the sliding step, the video to be detected is cut to generate a video segment with a length of L.
The method according to claim 3, wherein the <action participant, visual relationship, object> triplet is obtained through the following steps:

1) For the video segment, evenly sample M video frames;

2) Use the Faster-RCNN detector with ResNet-101 as the backbone network to detect the object o i in the video frame;

3) Detect the jth visual relationship e ij between the object o i and all action participants p in each video frame, and obtain the <action participant p, visual relationship e ij , object o i > triplet on the sampling frame.
The method of claim 4, wherein the confidence score is calculated by the following steps:

1) For the generated triplet <action participant p, visual relation e ij , object o i >, calculate the confidence score s p of the action participant p and the confidence score of the object o i
and the confidence score of the visual relationship e ij

2) According to the confidence score s p , the confidence score
with confidence score
Compute the confidence score for the entire triplet.
The method according to claim 1, wherein the probability of occurrence of each action in the video clip is obtained by the following steps:

1) Transform the formula f in the formula rule into a Horn clause according to the transformation criterion in the function defined on the continuous variable and first-order logic;

2) based on the Horn clause and the confidence score, calculate the value of each formula f i instance;

3) According to the value of the formula f i instance, obtain the true quantity n i of the formula f i ;

4) Based on the number n i , calculate the occurrence probability of each action in the video segment.
The method according to claim 1, characterized in that, by performing a maximum pooling operation on the occurrence probability of each action in each video segment, the result of action recognition on the entire video is obtained.
The method of claim 1, wherein the improved Markov logic network and the policy network generating rules for each action are trained by the following steps:

1) Generate a set of logical rules based on the rule policy network π l
Obtaining Improved Markov Logic Networks by Maximizing the Log-Likelihood Method
the weight of;

2) Fixed improved Markov logic network
, using the policy gradient algorithm and updating the rule-policy network parameters by maximizing the reward function to obtain the rule-policy network π l+1 , wherein the reward function is an action recognition evaluation index;

3) When the rule-policy network π l and the improved Markov logic network
When the set conditions are met, the trained rule policy network and the improved Markov logic network are obtained.
A storage medium, wherein a computer program is stored in the storage medium, wherein the computer program is configured to execute the method according to any one of claims 1-8 when running.
An electronic device, comprising a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to perform the method according to any one of claims 1-8.