CN116469155A

CN116469155A - Complex action recognition method and device based on learnable Markov logic network

Info

Publication number: CN116469155A
Application number: CN202210027024.0A
Authority: CN
Inventors: 金阳; 穆亚东
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2023-07-21
Also published as: WO2023134320A1

Abstract

The invention discloses a complex action recognition method and device based on a learnable Markov logic network, comprising: using a policy network to automatically learn a set of logical rules corresponding to each action from training data; dividing a video to be detected into several video segments, and calculating confidence scores for triples of <action participant, visual relationship, object> in each video segment; inputting the set of logical rules and the confidence scores of all triples in a video segment into an improved Markov logic network to obtain the occurrence probability of each action in the video segment; Probability, to obtain the action recognition result of the video to be detected. The invention does not need to rely on the definition of experts in the field, has remarkable interpretability, good compatibility and high efficiency, and can not only identify the category of the action, but also locate its position in the video clip.

Description

Complex action recognition method and device based on learnable Markov logic network

技术领域technical field

本发明属于计算机视觉领域，尤其涉及一种基于可学习马尔科夫逻辑网络的复杂动作识别方法及装置。The invention belongs to the field of computer vision, and in particular relates to a complex action recognition method and device based on a learnable Markov logic network.

背景技术Background technique

动作识别是视频理解领域中的一项基本任务，在过去几年中引起了研究人员的极大关注。最近，凭借着深度学习的迅猛发展，3D卷积神经网络(3D CNN)彻底革新了这一研究领域。依靠各种精心设计的网络架构和学习算法，它已成为视频动作识别任务的主流方法。和基于低级特征(如轨迹、关键点等)的早期工作相比，3D CNN强大的表征能力使它们能够更好地捕获跨视频帧的复杂语义依赖性。Action recognition is a fundamental task in the field of video understanding that has attracted considerable attention from researchers over the past few years. Recently, with the rapid development of deep learning, 3D convolutional neural network (3D CNN) has revolutionized this research field. Relying on various well-designed network architectures and learning algorithms, it has become a mainstream method for video action recognition tasks. Compared with earlier works based on low-level features (e.g., trajectories, keypoints, etc.), the powerful representation capabilities of 3D CNNs enable them to better capture complex semantic dependencies across video frames.

尽管这些深度神经网络在视频动作识别任务中取得了广泛应用，但它们仍然存在一些固有的缺陷。通常来说，3D CNN的工作流程如下：输入一个视频片段，经过多层网络的计算后，输出一个分数，该分数表示每个动作类别的置信度。可以看出，这种黑盒性质的预测机制并没有明确地提供识别一个动作的相关依据，如动作在视频中发生的时间、位置、发生的原因等等。另外，由于缺乏可解释性，这些深度神经网络也容易受到对抗攻击，这大大限制了其在具有严格安全要求的现实场景中的应用。近年来越来越多的研究工作致力于探索深度学习的可解释性。因此，开发一个可解释性高的动作推理框架显得尤为重要。Although these deep neural networks have achieved widespread applications in video action recognition tasks, they still have some inherent defects. Generally speaking, the workflow of 3D CNN is as follows: input a video clip, after calculation by multi-layer network, output a score, which represents the confidence of each action category. It can be seen that this black-box prediction mechanism does not explicitly provide relevant basis for identifying an action, such as when the action occurs in the video, where it occurs, why it occurs, and so on. In addition, these deep neural networks are also vulnerable to adversarial attacks due to lack of interpretability, which greatly limits their application in real-world scenarios with stringent security requirements. In recent years, more and more research efforts have been devoted to exploring the interpretability of deep learning. Therefore, it is particularly important to develop an action reasoning framework with high interpretability.

本发明基于一些在认知科学和神经科学的研究结论：即人们通常将一个复杂事件表示为一些原子单元的组合。另外，最近也有相关研究表明：一个复杂的动作可以分解成为一系列的时空场景图，它描绘了随着时间的推移，一个人如何与周围的物体进行交互。以图1所示的“人在床上醒来”这个动作为例。为了完成这个动作，一个人可能最初躺在床上，然后醒来并坐在床上。该过程可以用人与床之间的视觉关系在时间上的变化来表示，即从“person-lying on-bed”到“person-siting on-bed”。这样的特性使得模型可以通过检测视频中视觉关系的转变模式，来显式地识别动作的发生，从而显着提高其可解释性和鲁棒性。为了实现这个想法，本发明需要解决两个关键挑战：(1)如何从数据中自动地学习这种视觉关系转变模式，而不是通过耗费大量的精力去人工指定这些规则。(2)模型生成的规则中往往包含着一些噪声信息，如何避免这些噪声所带来的负面影响，从而执行高效的动作推理。The present invention is based on some research conclusions in cognitive science and neuroscience: that is, people usually express a complex event as a combination of some atomic units. In addition, recent related research has shown that a complex action can be broken down into a series of spatiotemporal scene graphs, which depict how a person interacts with surrounding objects over time. Take the action "the person wakes up in bed" shown in Figure 1 as an example. To perform this maneuver, a person may initially lie in bed, then wake up and sit up on the bed. This process can be represented by the temporal change of the visual relationship between the person and the bed, that is, from “person-lying on-bed” to “person-siting on-bed”. Such properties allow the model to explicitly recognize action occurrences by detecting transition patterns of visual relationships in videos, thereby significantly improving its interpretability and robustness. In order to realize this idea, the present invention needs to solve two key challenges: (1) How to automatically learn this visual relationship transformation mode from data, instead of manually specifying these rules by consuming a lot of effort. (2) The rules generated by the model often contain some noise information, how to avoid the negative impact of these noises, so as to perform efficient action reasoning.

发明内容Contents of the invention

为了弥补深度模型在可解释性上的欠缺，并解决上面提出的两个挑战，本发明公开了一种基于可学习马尔科夫逻辑网络的复杂动作识别方法及装置，通过设计一种新颖的可解释的动作推理框架，来识别出视频中的复杂动作。为此，本发明使用一阶逻辑来对复杂动作在语义状态上的时序变化进行建模。具体的，在每条逻辑规则中，视觉关系充当相应的原子谓词。这些逻辑规则中包含着丰富的信息，可以由一个规则策略网络通过逐步添加与动作相关的关系谓词来自动生成。由于规则是自动生成的，没有经过领域专家的精心定义，因此它们很容易出错。为了解决这个问题，本发明利用了马尔可夫逻辑网络(MLN)，它是一种结合了一阶逻辑和概率图模型的统计关系模型。该模型将每个逻辑规则和一个实值权重相关联，从而衡量逻辑规则的不确定性。如果权重越大，那么相应的规则就越可靠。利用这种方式，就可以给那些具有噪声信息的公式分配较低(甚至为负)的权重，从而减弱它们的不利影响。利用生成的公式和马尔科夫逻辑网络，本发明就可以进行概率逻辑推理，最终确定每个动作的发生概率。In order to make up for the lack of interpretability of the deep model and solve the two challenges raised above, the present invention discloses a complex action recognition method and device based on a learnable Markov logic network, and recognizes complex actions in videos by designing a novel interpretable action reasoning framework. To this end, the present invention uses first-order logic to model temporal changes in semantic states of complex actions. Specifically, in each logical rule, visual relations act as corresponding atomic predicates. These logical rules contain rich information and can be automatically generated by a rule-policy network by gradually adding action-related relational predicates. Since rules are automatically generated and not carefully defined by domain experts, they are prone to errors. To solve this problem, the present invention utilizes a Markov Logic Network (MLN), which is a statistical relational model combining first-order logic and a probabilistic graphical model. The model associates each logical rule with a real-valued weight, thereby measuring the uncertainty of the logical rule. If the weight is larger, then the corresponding rule is more reliable. In this way, lower (or even negative) weights can be assigned to formulas with noise information, thereby reducing their adverse effects. Utilizing the generated formula and the Markov logic network, the present invention can perform probabilistic logic reasoning, and finally determine the occurrence probability of each action.

本发明的技术内容包括：Technical contents of the present invention include:

一种基于可学习马尔科夫逻辑网络的复杂动作识别方法，其步骤包括：A complex action recognition method based on a learnable Markov logic network, the steps of which include:

使用一个策略网络从训练数据中自动学习出每个动作所对应的逻辑规则集合；Use a policy network to automatically learn the set of logical rules corresponding to each action from the training data;

将待检测视频切分为若干视频片段，并针对每一视频片段中的<动作参与者、视觉关系、物体>三元组，计算置信度分数；Divide the video to be detected into several video clips, and calculate the confidence score for the <action participant, visual relationship, object> triplet in each video clip;

将所述逻辑规则集合与一视频片段中的所有三元组的置信度分数输入一改进马尔科夫逻辑网络，得到该视频片段中每一动作的发生概率，其中将马尔科夫逻辑网络中的布尔变量之间的运算松弛替换为在连续变量上定义的函数，得到所述改进马尔科夫逻辑网络；The set of logical rules and the confidence scores of all triples in a video clip are input into an improved Markov logic network to obtain the probability of occurrence of each action in the video clip, wherein the operation relaxation between the Boolean variables in the Markov logic network is replaced by a function defined on continuous variables to obtain the improved Markov logic network;

根据所述发生概率，获取待检测视频的动作识别结果。According to the occurrence probability, an action recognition result of the video to be detected is acquired.

进一步地，通过以下步骤获取逻辑规则集合：Further, the logical rule set is obtained through the following steps:

1)在t时刻，计算上一时刻得到的关系谓词R_t-1的嵌入特征x_t-1；1) At time t, calculate the embedded feature x _t-1 of the relational predicate R _t-1 obtained at the previous time;

2)将x_t-1和隐藏状态h_t-1输入到一个门控循环神经网络GRU中；2) Input x _t-1 and hidden state h _t-1 into a gated recurrent neural network GRU;

3)根据GRU的输出，计算t时刻的关系谓词R_t的生成概率；3) Calculate the generation probability of the relational predicate R _t at time t according to the output of the GRU;

4)利用所述生成概率采样一个具体的关系谓词R_t；4) Utilize the generation probability to sample a specific relational predicate R _t ;

5)根据各个时刻关系谓词R_t的生成概率，获取公式f的被采样到概率；5) According to the generation probability of the relational predicate R _t at each moment, the sampled probability of the formula f is obtained;

6)基于被采样到概率，将一或多个公式f放入该动作的公式集合，以得到该动作所对应的逻辑规则集合。6) Based on the sampled probability, put one or more formulas f into the formula set of the action to obtain the logic rule set corresponding to the action.

进一步地，通过以下策略切分待检测视频：Further, the video to be detected is segmented through the following strategies:

1)生成具有多种不同尺寸的滑动窗口；1) Generate sliding windows with multiple different sizes;

2)对于尺寸大小为L的滑动窗口，将该滑动窗口的滑动步长设置为L/2；2) For a sliding window whose size is L, set the sliding step of the sliding window to L/2;

3)按所述滑动步长，对待检测视频进行切割，生成长度为L的视频片段。3) According to the sliding step, the video to be detected is cut to generate a video segment with a length of L.

进一步地，通过以下步骤得到<动作参与者、视觉关系、物体>三元组：Further, the <action participant, visual relationship, object> triplet is obtained through the following steps:

1)对于所述视频片段，均匀地采样M个视频帧；1) For the video segment, evenly sample M video frames;

2)利用以ResNet-101作为骨干网络的Faster-RCNN检测器，检测出视频帧中的物体o_i；2) Use the Faster-RCNN detector with ResNet-101 as the backbone network to detect the object o _i in the video frame;

3)检测出各视频帧中物体o_i与所有动作参与者p的第j个视觉关系e_ij，得到采样帧上的<动作参与者p、视觉关系e_ij、物体o_i>三元组。3) Detect the jth visual relationship e _ij between the object o _i and all action participants p in each video frame, and obtain the <action participant p, visual relationship e _ij , object o _i > triplet on the sampling frame.

进一步地，通过以下步骤计算置信度分数：Further, the confidence score is calculated by the following steps:

1)对于生成的三元组<动作参与者p、视觉关系e_ij、物体o_i>，计算动作参与者p的置信度分数s_p、物体o_i的置信度分数及视觉关系e_ij的置信度分数/> 1) For the generated triplet <action participant p, visual relation e _ij , object o _i >, calculate the confidence score s _p of the action participant p and the confidence score of the object o _i and the confidence score of the visual relationship e _ij />

2)根据置信度分数s_p、置信度分数与置信度分数/>计算整个三元组的置信度分数。2) According to the confidence score s _p , the confidence score with confidence score /> Compute the confidence score for the entire triplet.

进一步地，通过以下步骤得到该视频片段中每一动作的发生概率：Further, the probability of occurrence of each action in the video clip is obtained through the following steps:

1)根据在连续变量上定义的函数与一阶逻辑中的变换准则，将公式规则中的公式f转化为Horn子句；1) Transform the formula f in the formula rule into a Horn clause according to the transformation criterion in the function defined on the continuous variable and first-order logic;

2)基于Horn子句与所述置信度分数，计算各公式f_i实例的值；2) based on the Horn clause and the confidence score, calculate the value of each formula f _i instance;

3)根据公式f_i实例的值，得到公式f_i取值为真的数量n_i；3) According to the value of the formula f _i instance, obtain the true quantity n _i of the formula f _i ;

4)基于数量n_i，计算该视频片段中每一动作的发生概率。4) Based on the number n _i , calculate the occurrence probability of each action in the video segment.

进一步地，通过对各视频片段中每一动作的发生概率执行最大池化操作，获取整个视频上动作识别的结果。Further, by performing a maximum pooling operation on the occurrence probability of each action in each video clip, the result of action recognition on the entire video is obtained.

进一步地，通过以下步骤训练改进马尔科夫逻辑网络与生成每一动作的规则的策略网络：Further, the improved Markov logic network and the policy network that generates the rules for each action are trained through the following steps:

1)基于规则策略网络π^l生成逻辑规则集合通过最大化对数似然方法，获取改进马尔科夫逻辑网络/>的权重；1) Generate a set of logical rules based on the rule policy network π ^l Obtaining improved Markov logic networks by maximizing the log-likelihood method /> the weight of;

2)固定改进马尔科夫逻辑网络的权重，使用策略梯度算法并通过最大化奖励函数来更新规则策略网络参数，得到规则策略网络π^l+1，其中所述奖励函数为动作识别评价指标；2) Fixed improved Markov logic network , using the policy gradient algorithm and updating the rule-policy network parameters by maximizing the reward function to obtain the rule-policy network π ^l+1 , wherein the reward function is an action recognition evaluation index;

3)当规则策略网络π^l与改进马尔科夫逻辑网络满足设定条件时，得到训练好的规则策略网络与改进马尔科夫逻辑网络。3) When the rule-policy network π ^l and the improved Markov logic network When the set conditions are met, the trained rule policy network and the improved Markov logic network are obtained.

一种存储介质，所述存储介质中存储有计算机程序，其中，所述计算机程序被设置为运行时执行上述所述的方法。A storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above-mentioned method when running.

一种电子装置，包括存储器和处理器，所述存储器中存储有计算机程序，所述处理器被设置为运行所述计算机以执行上述所述的方法。An electronic device includes a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer to execute the above-mentioned method.

与现有技术相比，本发明的优点为：Compared with prior art, the advantages of the present invention are:

(1)优越的可解释性。与现在流行的深度3D卷积神经网络相比，本发明所提出的动作推断框架具有显著的可解释性，因为带有权重的逻辑规则可以作为关于识别特定动作的重要依据。此外，得益于对动作的时间演化模式的显式建模，本发明的框架不仅可以识别出动作的类别，还可以定位其在视频片段中所处的位置。(1) Superior interpretability. Compared with the popular deep 3D convolutional neural network, the action inference framework proposed by the present invention has remarkable interpretability, because the logical rules with weights can be used as an important basis for identifying specific actions. Furthermore, thanks to the explicit modeling of the temporal evolution patterns of actions, our framework can not only identify the category of the action, but also localize its location in the video clip.

(2)无需依靠的领域专家的定义。本发明提出的规则策略网络可以从数据中自动地学习出用于编码复杂动作的逻辑规则，而不需要人工进行定义，使得整个框架具有更高的鲁棒性。现有的一些利用马尔可夫逻辑网络进行推理的工作往往依赖领域专家对编码事件规则的精心设计，这极大限制了他们的适用性。从数据中自动挖掘规则的特性也使得本发明的推理框架可以适用于大数据场景下。(2) Definition of domain experts without reliance. The rule-policy network proposed by the present invention can automatically learn logical rules for encoding complex actions from data without manual definition, making the whole framework more robust. Existing works using Markov logic networks for reasoning often rely on the careful design of rules for encoding events by domain experts, which greatly limits their applicability. The feature of automatically mining rules from data also makes the reasoning framework of the present invention applicable to big data scenarios.

(3)兼容性和高效性。本方法可以很好地与目前现有的基于深度模型的方法相结合，进一步提高动作识别的性能。另外，本发明的模型不需要过多的训练数据就可以挖掘出动作所对应的关系变化模式，取得较好的预测结果。(3) Compatibility and efficiency. This method can be well combined with the existing deep model-based methods to further improve the performance of action recognition. In addition, the model of the present invention can dig out the relationship change pattern corresponding to the action without too much training data, and obtain better prediction results.

附图说明Description of drawings

图1为将动作分解为时空场景图的示例图。Figure 1 is an example diagram of decomposing an action into a spatio-temporal scene graph.

图2为整个方法的计算流程。Figure 2 is the calculation flow of the whole method.

图3为模型学习到的规则和相应权重的可视化。Figure 3 is a visualization of the rules and corresponding weights learned by the model.

图4为本发明模型的用户调查结果图。Fig. 4 is a graph of user survey results of the model of the present invention.

具体实施方式Detailed ways

为了更具体说明本发明的技术细节和优势，下面通过实施例和附图，对本发明做进一步详细说明。In order to more specifically illustrate the technical details and advantages of the present invention, the present invention will be further described in detail below through the embodiments and accompanying drawings.

如前所述，复杂的动作通常可以分解人与物体随着时间的交互变化。受此结论的启发，本发明通过建模这种视觉关系的演变模式进而发明了一个可解释的动作推理框架。如图1所示，本发明所提出的方法主要由两个主要部分组成。第一个是规则策略网络，该网络的目的是挖掘出每个动作最优的公式集合，其中的每个公式都明确表示一种特定的关系转换模式。第二个是动作推理模块，它基于策略网络所生成的公式集合，利用马尔可夫逻辑网络进行概率逻辑推理，计算每个动作发生的概率。接下来，本发明将阐述每个模块的实现细节以及整个框架的训练算法。As mentioned earlier, complex actions can often decompose human-object interaction changes over time. Inspired by this conclusion, the present invention invents an interpretable action reasoning framework by modeling the evolution pattern of such visual relationships. As shown in Fig. 1, the method proposed by the present invention mainly consists of two main parts. The first is the rule-policy network, which aims to mine the optimal set of formulas for each action, where each formula explicitly expresses a specific relational transformation pattern. The second is the action reasoning module, which is based on the formula set generated by the policy network, uses the Markov logic network to perform probabilistic logic reasoning, and calculates the probability of each action occurring. Next, the present invention will elaborate the implementation details of each module and the training algorithm of the whole framework.

1.马尔科夫逻辑网络1. Markov logic network

马尔可夫逻辑网络(MLN)是一种结合了逻辑的概率图模型，它利用一阶逻辑来定义传统的马尔可夫随机场中的势函数。在马尔科夫逻辑网络中，每个逻辑公式都有一个与之相关联的实值权重，表示该公式的重要性和可靠性。一个具有较高的权重的公式往往越重要，其编码的知识也更加可靠。从本质上讲，马尔可夫逻辑网络解除了原本一阶逻辑中的硬约束，使得一些可靠性不高甚至错误的公式仍然可以被包含进来：不是不可能而是可能性更低。Markov logic network (MLN) is a probabilistic graphical model incorporating logic, which uses first-order logic to define potential functions in traditional Markov random fields. In Markov Logic Networks, each logic formula has a real-valued weight associated with it, denoting the importance and reliability of that formula. A formula with a higher weight tends to be more important, and the knowledge it encodes is more reliable. In essence, the Markov logic network removes the hard constraints in the original first-order logic, so that some formulas with low reliability or even errors can still be included: it is not impossible but less likely.

具体来说，令表示一个逻辑公式集合，ω_i表示公式/>所对应的权重，/> 为一组有限的常量集合。那么，马尔可夫逻辑网络/>遵循下面的定义：f_i中的每个原子谓词的每个可能的赋值都可以看作是/>中的一个二元节点，如果该逻辑谓词对应的赋值为真，那么该二元节点的值为1，否则为0。每个公式f_i的每个可能的赋值都充当了一个势函数，如果公式为真，则势函数的值为1，否则就为0。因此，如果马尔科夫逻辑网络/>中的两个节点之间存在边，当且仅当它们所对应的逻辑谓词同时出现在一个公式中。公式集合/>可以看作是构建马尔科夫逻辑网络的一个模板。通过这样的定义，一个状态x所对应的概率就可以表示为Specifically, let Represents a set of logical formulas, ω _i represents the formula /> The corresponding weight, /> is a limited set of constants. Then, the Markov logic network /> Following the definition: every possible assignment of every atomic predicate in f _i can be seen as /> A binary node in , if the assignment corresponding to the logical predicate is true, then the value of the binary node is 1, otherwise it is 0. Each possible assignment of each formula f _i acts as a potential function whose value is 1 if the formula is true and 0 otherwise. Therefore, if the Markov logic network /> An edge exists between two nodes in if and only if their corresponding logical predicates appear in a formula at the same time. formula collection/> It can be regarded as a template for constructing Markov logic network. Through this definition, the probability corresponding to a state x can be expressed as

其中n_i(x)是公式f_i在不同赋值x中，取值为真的个数。F是公式集合的大小，Z是一个归一化常量，其值为/> Where n _i (x) is the number of true values of the formula f _i in different assignments x. F is the formula collection The size of , Z is a normalization constant whose value is />

2.逻辑规则生成2. Logic rule generation

与使用人为定义逻辑公式的一些方法不同，本发明的目标是自动地为每个动作生成其相对应地逻辑公式，而无需依赖任何人力。具体来说，本发明用来建模人-物交互模式具有下面的形式：R₁∧...∧R_t...∧R_T，其中R_1：T表示不同帧上的关系谓词，T表示这些谓词的总数。进而，编码一个复杂动作a的公式f可以表示为：Unlike some methods that use human-defined logic formulas, the goal of the present invention is to automatically generate its corresponding logic formula for each action without relying on any human power. Specifically, the human-object interaction pattern used in the present invention has the following form: R ₁ ∧ ... ∧ R _t ... ∧ R _T , where R _{1: T} represents the relationship predicates on different frames, and T represents the total number of these predicates. Furthermore, the formula f encoding a complex action a can be expressed as:

其中A是动作a的谓词表示形式。因此，给定一个特定的动作谓词A，只需要确定f的左边部分。由于上述公式中的只包含合取操作(∧)，所以可以进一步表示为线性序列/>依靠上面的这些定义，公式f的生成就转变成了一个序列决策过程：即为每个动作预测出一个最合适的序列l_f。为了实现整个目的，本发明使用策略网络π对这个过程进行建模。该策略网络用来近似地估计对于动作a，其所有可能的公式f应满足的概率分布π(f|a；θ)，这里的θ是概率分布所对应的参数。一旦θ确定，本发明就可以相应地从π(f|a；θ)中抽取一些样本来构成所需的公式集合/>本发明利用门控循环神经网络(GRU)来表达该概率分布。具体的，该网络可以表述为：where A is the predicate representation of action a. Thus, given a particular action predicate A, only the left part of f needs to be determined. Due to the above formula Contains only conjunction operations (∧), so it can be further expressed as a linear sequence /> Relying on the above definitions, the generation of the formula f is transformed into a sequence decision process: that is, to predict a most suitable sequence l _f for each action. To achieve the overall purpose, the present invention uses a policy network π to model this process. The policy network is used to approximate the probability distribution π(f|a; θ) that all possible formulas f should satisfy for action a, where θ is the parameter corresponding to the probability distribution. Once θ is determined, the present invention can correspondingly extract some samples from π(f|a; θ) to form the required formula set/> The present invention utilizes a Gated Recurrent Neural Network (GRU) to express this probability distribution. Specifically, the network can be expressed as:

h_t＝GRU(x_t，h_t-1) (3) _ht = GRU( _xt , ht _-1 ) (3)

其中x_t是关系谓词R_t在t-th时间的嵌入特征，h_t-1表示策略网络π对应的隐藏状态，它融合了所有过去时间的关系谓词{R₁，...，R_t-1}信息。在初始步骤，本发明将动作谓词A的特征向量x₀输入到π，接着每个谓词R_t的生成概率由下式计算：where x _t is the embedded feature of the relational predicate R _t at time t-th, and h _t-1 represents the hidden state corresponding to the policy network π, which integrates the information of relational predicates {R ₁ ,...,R _t-1 } from all past times. In the initial step, the present invention inputs the feature vector _x0 of the action predicate A into π, and then the generation probability of each predicate R _t is calculated by the following formula:

p(R_t|R₁，...，R_t-1，A)＝softmax(W_ph_t) (4)p(R _t |R ₁ ,...,R _t-1 ,A)=softmax(W _p h _t ) (4)

其中W_p是要从数据中学习的参数。在训练过程中，本发明可以根据上面的概率采样一个序列来得到一个具体的公式f。因此，每个公式f被采样到的概率为：where W _p is the parameter to be learned from the data. During the training process, the present invention can sample a sequence according to the above probability To get a specific formula f. Therefore, the probability that each formula f is sampled is:

在训练完策略网络π后，本发明利用束搜索(beam search)策略从分布π(f|a；θ)中为每个动作a采样k个最佳的公式作为生成的公式集合 After training the policy network π, the present invention uses the beam search strategy to sample k best formulas for each action a from the distribution π(f|a; θ) as the generated formula set

3.动作推理3. Action reasoning

本节主要介绍对于动作的详细概率推理过程。整个推理模块主要包含三个步骤(见图1)。接下来，本发明将分别介绍它们。This section mainly introduces the detailed probabilistic inference process for actions. The whole reasoning module mainly includes three steps (see Figure 1). Next, the present invention will introduce them respectively.

(Step1)基于滑动窗口的视频片段生成。给定一个未剪辑的长视频v，本发明首先利用滑动窗口机制来处理v以生成多个视频片段。鉴于不同种类的动作往往在时间跨度上呈现出很大的变化，本发明将滑动窗口的大小设置成多种不同的尺寸以生成不同长度的视频片段。另外，对于尺寸大小为L的滑动窗口，本发明将其滑动步长设置为L/2，这样每个视频片段都有L/2帧与相邻的片段重叠。将滑动窗口生成的全部视频片段集合表示为U，作为视频v中可能存在动作的备选提案。(Step1) Generation of video clips based on sliding windows. Given a long uncut video v, the present invention first utilizes a sliding window mechanism to process v to generate multiple video segments. In view of the fact that different types of actions often show great changes in the time span, the present invention sets the size of the sliding window to a variety of different sizes to generate video clips of different lengths. In addition, for a sliding window whose size is L, the present invention sets its sliding step to L/2, so that each video segment has L/2 frames overlapping with adjacent segments. Denote the set of all video clips generated by the sliding window as U as candidate proposals for possible actions in video v.

(Step2)场景图预测。对于上一步中生成的每个视频片段u∈U，本发明利用一个预训练好的场景图预测器来提取视频帧上的高级视觉信息。具体来说，该预测器首先利用以ResNet-101作为骨干网络的Faster-RCNN检测器，检测出每一帧中的所有物体。接着，预测出这些物体和人之间所有可能的视觉关系。生成的场景图可以表示为G＝(O，E)。这里的O＝{o₁，o₂，...}是与动作参与者p所交互的物体集合，E＝{{e₁₁，e₁₂，...}，{e₂₁，e₂₂，...}}表示人和物体之间的视觉关系，其中e_ij表示动作参与者p与第i个物体o_i之间的第j种视觉关系。在这里，由于视觉交互的多样性，每个参与者和物体之间可能存在着多种不同类型的视觉关系。值得注意的是，每个三元组r_ij＝<p，e_ij，o_i〉可以被视为其在视频片段上的对应的关系谓词的一个具体实例。此外，该实例r_ij的置信度分数由下式给出：(Step2) Scene graph prediction. For each video segment u ∈ U generated in the previous step, the present invention utilizes a pre-trained scene graph predictor to extract high-level visual information on video frames. Specifically, the predictor first detects all objects in each frame using a Faster-RCNN detector with ResNet-101 as the backbone network. Next, all possible visual relationships between these objects and people are predicted. The generated scene graph can be expressed as G=(O, E). Here O={o ₁ , o ₂ ,...} is the set of objects interacting with action participant p, E={{e ₁₁ , e ₁₂ ,...}, {e ₂₁ , e ₂₂ ,...}} represent the visual relationship between people and objects, where e _ij represents the jth visual relationship between action participant p and the i-th object o _i . Here, due to the diversity of visual interactions, there may be many different types of visual relationships between each participant and object. It is worth noting that each triple r _ij =<p, e _ij , o _i > can be regarded as a specific instance of its corresponding relational predicate on the video segment. In addition, the confidence score of the instance r _ij is given by:

这里的s_p，分别是预测的动作执行者p、物体o_i以及它们之间的关系e_ij的置信度分数，由场景图预测器给出。考虑到人与物体之间的视觉关系在几个连续的视频帧中几乎不会发生变化，如果本发明片段u中的每一帧都生成一个场景图，将造成计算的冗余。因此，只有M帧被从片段u∈U中均匀采样来进行上述的预测。Here _sp , are respectively the confidence scores of the predicted action performer p, object o _i and the relationship e _ij between them, given by the scene graph predictor. Considering that the visual relationship between people and objects hardly changes in several consecutive video frames, if a scene graph is generated for each frame in segment u in the present invention, it will cause redundant computation. Therefore, only M frames are uniformly sampled from segment u ∈ U for the above prediction.

(Step3)动作概率推断。给定一个训练好的马尔可夫网络视频片段中每个动作a的概率就可以进行推断。根据式(1)，计算整个概率就需要确定关于公式f_i在片段u上取值为真的实例数量n_i(x)。在原本的马尔可夫逻辑网络中，逻辑公式的取值是对二元谓词进行逻辑运算得到的，该二元谓词只能取离散的值0或1。然而，本发明的关系谓词实例采用公式中指定的实数值/>其范围为[0，1]。这种性质使得本发明很难界定一个公式实例是否应该取值为1或0。为了确保与一阶逻辑中的逻辑运算/>相兼容，本发明使用Lukasiewicz逻辑将布尔变量之间的运算松弛为在连续变量上定义的函数。松弛后的合取(∧)、析取(V)和否定/>可以被定义为：X∨Y＝max(0，X+Y-1)，X∨Y＝min(1，X+Y)，/> 使用上述的松弛，就可以有效地计算n_i(x)。以式2左边的公式为例，根据一阶逻辑中的变换准则，这样的公式可以先转化为Horn子句：(Step3) Action probability inference. Given a trained Markov network The probability of each action a in the video segment can then be inferred. According to formula (1), calculating the overall probability requires determining the number n _i (x) of instances where the formula f _i is true on segment u. In the original Markov logic network, the value of the logic formula is obtained by logical operation on the binary predicate, and the binary predicate can only take the discrete value 0 or 1. However, the relational predicate instances of the present invention take the real values specified in the formula /> Its range is [0, 1]. This property makes it difficult for the present invention to define whether a formula instance should take a value of 1 or 0. To ensure logical operations in first-order logic with Compatibly, the present invention uses Lukasiewicz logic to relax operations between Boolean variables into functions defined on continuous variables. Relaxed conjunction (∧), disjunction (V) and negation/> Can be defined as: X∨Y=max(0, X+Y-1), X∨Y=min(1, X+Y), /> Using the relaxation described above, _ni (x) can be computed efficiently. Taking the formula on the left side of Equation 2 as an example, according to the transformation criterion in first-order logic, such a formula can be transformed into a Horn clause first:

其可以被看作是肯定或否定的谓词之间的析取。It can be seen as a disjunction between positive or negative predicates.

然后，基于u上预测的场景图，每个公式实例f_i(x)的值为：Then, based on the predicted scene graph on u, the value of each formula instance f _i (x) is:

这里的是通过式(6)获得的置信度分数。x_a是一个二进制变量，其值为0或1，表示动作a是否发生。因此，n_i(x)可以通过将所有公式实例的值f_i(x)相加而得到。此后，视频片段u上动作a发生的概率由下式给出：here is the confidence score obtained by formula (6). x _a is a binary variable whose value is 0 or 1, indicating whether action a occurred or not. Therefore, n _i (x) can be obtained by adding the values f _i (x) of all formula instances. Thereafter, the probability of action a occurring on video segment u is given by:

其中F_a是与动作a相关的公式数量，MB_x(a)表示a的马尔可夫毯，是所有公式中与a一起出现的三元组。where F _a is the number of formulas associated with action a, and MB _x (a) represents the Markov blanket of a, which is the triplet that occurs with a in all formulas.

整个视频v的最终预测结果通过对片段集合U执行最大池化操作得到。The final prediction for the entire video v is obtained by performing a max-pooling operation on the set of segments U.

4.联合训练算法4. Joint training algorithm

本发明的目标是从训练数据中学习出最合适的马尔可夫逻辑网络为此，本发明设计的训练方案包括两个主要阶段：规则探索和权重学习。由于规则探索阶段的离散性，不能通过基于最终损失函数的反向传播来直接优化策略网络π。因此，本发明提出了一个联合训练策略。其中，通过强化学习中的策略梯度算法来优化规则探索阶段，并通过监督学习来优化生成规则的对应权重。The goal of the present invention is to learn the most suitable Markov logic network from the training data To this end, the training scheme designed by the present invention includes two main stages: rule exploration and weight learning. Due to the discrete nature of the rule exploration phase, the policy network π cannot be directly optimized by backpropagation based on the final loss function. Therefore, the present invention proposes a joint training strategy. Among them, the policy gradient algorithm in reinforcement learning is used to optimize the rule exploration stage, and the corresponding weight of the generated rules is optimized through supervised learning.

假设本发明从π(f|a；θ)中采样得到一个公式f，那么本发明可以通过最大化奖励函数的期望来训练规则策略网络：Assuming that the present invention samples a formula f from π(f|a; θ), then the present invention can train the rule-policy network by maximizing the expectation of the reward function:

J(θ)＝E_f～π(f|a；θ)[H(f)] (10)J(θ)＝E _f～π (f|a;θ)[H(f)] (10)

这里的H(f)是mAP等评价动作识别性能的指标。进一步的，梯度为：其可以通过蒙特卡罗采样来进行估计：Here H(f) is an index for evaluating action recognition performance such as mAP. Further, the gradient for: It can be estimated by Monte Carlo sampling:

这里的K是采样次数。此外，本发明还引入了一个基线b，它是最近几次H(f_k)的指数移动平均。这样，式(11)中的原始奖励函数被替换为H(f_k)-b。另外，为了鼓励规则探索的多样性，本发明还添加了π(f|a；θ)上的熵正则化到最终得到损失函数中。Here K is the number of samples. In addition, the present invention also introduces a baseline b, which is the exponential moving average of the recent H(f _k ). In this way, the original reward function in Equation (11) is replaced by H(f _k )-b. In addition, in order to encourage the diversity of rule exploration, the present invention also adds entropy regularization on π(f|a; θ) to the final loss function.

权重学习阶段旨在为生成的公式学习到一个适当的权重，这可以通过最大化对数似然来实现：The weight learning phase aims to learn an appropriate weight for the generated formula, which can be achieved by maximizing the log-likelihood:

这里的N表示一批量训练数据的大小，x_a是二元变量，如果第i个视频v_i中存在动作a，则其值为1，否则为0。Here N represents the size of a batch of training data, and x _a is a binary variable whose value is 1 if there is an action a in the ith video v _i , otherwise it is 0.

整个训练过程将在规则探索和权重学习之间交替执行。首先，本发明利用初始化的规则策略网络π来生成的公式集合进行权重学习，然后固定学习好的权重，计算动作识别精度来估计式(11)中的梯度以此来更新π的参数。之后，本发明利用更新后的π来生成的新的/>并进行权重训练。这两个阶段将交替进行数次。The whole training process will alternate between rule exploration and weight learning. First, the present invention utilizes the initialized rule-policy network π to generate a set of formulas Carry out weight learning, then fix the learned weights, and calculate the action recognition accuracy to estimate the gradient in formula (11) to update the parameters of π. Afterwards, the present invention utilizes the updated π to generate a new /> and perform weight training. These two phases will alternate several times.

5.与深度模型的结合5. Combination with deep model

一个未剪辑的视频中通常包含有多个动作，这些动作之间可能存在着一些潜在的联系。以Charades中的一个视频为例，动作“拿着扫帚”、“把扫帚放在某处”和“清理地板上的东西”之间可能存在一些合理的联系：当一个人在地板上清理东西时，他可能会拿着扫帚，然后在整理完成后把扫帚放回去。因此，本发明提出的方法可以作为一个推理层和深度模型的输出相结合，从而能够基于容易检测动作(例如：拿着扫帚)的预测结果，来增强对较难检测动作类别(例如：清理地板上的东西)的识别。具体来说，本发明的框架可以被用来学习一些逻辑公式和相应的权重来表示这些动作之间的联系。在推理的过程中，给定深度模型输出的置信度分数，本发明将具有高置信度的检测结果视作观察到的证据，并对其他的动作类别进行概率推理，从而提高其检测精度。An unedited video usually contains multiple actions, and there may be some potential connections between these actions. Taking a video in Charades as an example, there may be some reasonable connection between the actions "hold the broom", "put the broom somewhere", and "clean up something on the floor": when a person is cleaning up something on the floor, he may hold the broom and then put the broom back when he is done tidying up. Therefore, the method proposed in the present invention can be used as an inference layer combined with the output of the deep model, so that it can enhance the recognition of difficult-to-detect action categories (such as: cleaning things on the floor) based on the prediction results of easy-to-detect actions (such as: holding a broom). Specifically, the framework of the present invention can be used to learn some logical formulas and corresponding weights to represent the connections between these actions. In the reasoning process, given the confidence score output by the deep model, the present invention regards the detection results with high confidence as observed evidence, and performs probabilistic reasoning on other action categories, thereby improving its detection accuracy.

6.实验效果6. Experimental effect

为了充分说明本发明的技术相比已有技术表现更好，本发明在两个有代表性的实验数据集Charades和CAD-120上进行实验。前者是一个大型的视频数据集，由大约9800个未剪辑的视频组成，其中7,985个用于训练，1,863个用于测试。这些视频包含了157项复杂的日常活动，涉及了15种不同的室内场景。平均来说，每个视频包含6.8个不同的动作类别，通常在同一帧中有多个动作类别，这使得识别极具挑战性。后者是一个专注于人类日常生活活动的RGB-D数据集。它由551个视频剪辑和32,327帧组成，涉及10种不同的高级活动(例如进餐、组装物体)。对于Charades，本发明计算mAP(Mean Average Precision)来评估所有动作类别的检测性能。而对于CAD-120，本发明采用mAR(Mean Average Recall)指标来衡量模型是否成功识别执行的动作。In order to fully illustrate that the technology of the present invention performs better than the existing technology, the present invention conducts experiments on two representative experimental data sets Charades and CAD-120. The former is a large video dataset consisting of about 9,800 uncut videos, of which 7,985 are used for training and 1,863 for testing. The videos contained 157 complex daily activities involving 15 different indoor scenarios. On average, each video contains 6.8 different action categories, and often there are multiple action categories in the same frame, making recognition extremely challenging. The latter is an RGB-D dataset focusing on human activities of daily living. It consists of 551 video clips and 32,327 frames involving 10 different high-level activities (e.g. eating, assembling objects). For Charades, the present invention calculates mAP (Mean Average Precision) to evaluate the detection performance of all action categories. For CAD-120, the present invention uses the mAR (Mean Average Recall) index to measure whether the model successfully recognizes the executed action.

表1展示了Charades上的动作识别结果。从中可以看出，本发明的模型达到了38.4％mAP，并超越了强大的3D CNN模型，如I3D、3D R-101和Non-Local等。这表明本发明的模型可以通过生成的公式及其权重来充分利用动作在时间维度上的交互信息。得益于在大型视频基准Kinetics的预训练，最先进的3D模型(例如X3D)虽然相比本发明的模型取得了更高的性能，但本发明的方法超过了仅在ImageNet上预训练的深度模型(38.4％vs21.0％)。另外，由于场景图预测器的精度限制，本发明还设计了一个Oracle版本。该版本假设所有视频帧上的视觉关系都得到了正确预测。如表1的底部所示，本发明的Oracle版本在mAP性能上取得了显着提高(大约24％)，并且大幅度地超过了所有深度模型，这证明了本发明方法的强大潜力。本发明还评估了与深度模型SlowFast(R-50)的集成。可以看出，通过利用不同动作之间的关系，本发明的模型能够进一步提高深度模型的性能。Table 1 shows the action recognition results on Charades. It can be seen that the model of the present invention achieves 38.4% mAP and surpasses powerful 3D CNN models such as I3D, 3D R-101 and Non-Local etc. This shows that the model of the present invention can make full use of the interaction information of actions in the time dimension through the generated formulas and their weights. Thanks to the pre-training on the large-scale video benchmark Kinetics, although the state-of-the-art 3D models (such as X3D) achieve higher performance than the model of the present invention, the method of the present invention exceeds the depth model pre-trained on ImageNet only (38.4% vs 21.0%). In addition, due to the limitation of the accuracy of the scene graph predictor, the present invention also designs an Oracle version. This version assumes that visual relationships are correctly predicted on all video frames. As shown at the bottom of Table 1, the oracle version of the present invention achieves a significant improvement (about 24%) in mAP performance, and surpasses all deep models by a large margin, which demonstrates the strong potential of our method. The present invention also evaluates the integration with the deep model SlowFast (R-50). It can be seen that the model of the present invention can further improve the performance of deep models by exploiting the relationship between different actions.

表1：不同方法在Charades下的动作识别性能比较Table 1: Comparison of action recognition performance of different methods under Charades

对于CAD-120数据集，本发明将一个长视频序列分成小片段，使得每个片段只包含一个动作，并评估每个动作的平均召回率.如表2所示，本发明的模型在mAR性能上取得了最好的结果。虽然方法Explainable AAR-RAR也采用了一个可解释的识别框架，但他们是基于领域专家定义的转换模式，并通过观察相邻两个连续帧之间的特定状态转换来执行动作推理。与之相比，本发明的模型利用了从真实数据中学习到的逻辑规则，更加鲁棒和高效。For the CAD-120 dataset, the present invention divides a long video sequence into small segments such that each segment contains only one action, and evaluates the average recall rate of each action. As shown in Table 2, our model achieves the best results in terms of mAR performance. Although the method Explainable AAR-RAR also adopts an explainable recognition framework, they are based on transition patterns defined by domain experts and perform action inference by observing specific state transitions between adjacent two consecutive frames. In contrast, the model of the present invention utilizes logical rules learned from real data, and is more robust and efficient.

表2：不同方法在CAD-120下的动作识别性能比较Table 2: Comparison of action recognition performance of different methods under CAD-120

本发明的模型通过利用可解释的逻辑公式来识别复杂的动作，从而可以提供令人信服的证据来说明做出这种预测的原因。因此，根据这些证据出现的时间，本发明还可以定位动作在视频中出现的时间。本发明将本方法的结果和几个先进的深度模型在Charades上进行了相关比较。从表3中可以看出，本发明的模型取得了优越的的动作定位结果。和仅在ImageNet上做预训练的模型相比，本发明的性能更好(20.9％mAP vs 14.2％mAP)。此外，本发明仍然获得了与在Kinetics做预训练的模型相近的定位结果。尽管在mAP性能上略弱于该模型，但本发明的动作定位结果更具可解释性。The model of the present invention can provide convincing evidence for the reason for making such predictions by utilizing interpretable logical formulas to recognize complex actions. Therefore, based on when these evidences occur, the present invention can also locate when the action occurs in the video. The present invention compares the results of this method with several advanced deep models on Charades. As can be seen from Table 3, the model of the present invention achieves superior action localization results. Compared with the model pre-trained on ImageNet only, the performance of the present invention is better (20.9% mAP vs 14.2% mAP). In addition, the present invention still obtains a positioning result close to that of the pre-trained model in Kinetics. Although slightly weaker than this model in mAP performance, the action localization results of our invention are more interpretable.

表3：不同方法在Charades下的动作定位性能比较Table 3: Comparison of action localization performance of different methods under Charades

为了说明生成逻辑规则的可解释性和多样性，本发明在图3中举例说明了模型学习的公式和相关权重。从图3可以观察到，具有较高权重的公式通常为动作提供更好的推理证据。例如，“holding broom→standing on floor→looking at floor”为检测动作“tidying something on the floor”提供了一个清晰的推断证据。此外，本发明还进行了有关可解释性的用户调查。在本发明的用户调查中，根据大小，模型生成公式的权重的被均匀分成三类，每一类的规则相应地表示为好的、中间的和坏的。接着，本发明从Charades上采样20个动作类别，并从每种类型的公式中随机抽取1个公式作为该类型的代表。参与用户调查的21名受试者，根据与动作的相关性对于乱序后的公式进行重新排序。用户调查的统计结果如图4所示。正如所观察到的，调查结果显示了学习的公式权重和人类常识之间呈现出高度的一致性(例如，78.75％的好规则仍然被人为标记为好的)。In order to illustrate the interpretability and diversity of the generated logic rules, the present invention illustrates the model learning formulas and related weights in FIG. 3 . From Figure 3, it can be observed that formulas with higher weights generally provide better inference evidence for actions. For example, "holding broom→standing on floor→looking at floor" provides a clear inference evidence for the detection action "tidying something on the floor". In addition, the present invention conducts a user survey on interpretability. In the user survey of the present invention, the weights of the model generation formulas are evenly divided into three categories according to the size, and the rules of each category are correspondingly represented as good, middle and bad. Next, the present invention samples 20 action categories from Charades, and randomly selects one formula from each type of formula as the representative of this type. The 21 subjects who participated in the user survey reordered the formulas after reordering according to their relevance to actions. The statistical results of the user survey are shown in Figure 4. As observed, the findings show a high degree of agreement between the learned formula weights and human common sense (e.g., 78.75% of good rules are still labeled as good by humans).

以上实施例仅用以说明本发明的技术方案而非对其进行限制，本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明的原理和范围，本发明的保护范围应以权利要求书所述为准。The above embodiments are only used to illustrate the technical solution of the present invention and not to limit it. Those skilled in the art can modify or equivalently replace the technical solution of the present invention without departing from the principle and scope of the present invention. The scope of protection of the present invention should be as defined in the claims.

Claims

1. A complex action recognition method based on a learnable Markov logic network comprises the following steps:

automatically learning a logic rule set corresponding to each action from the training data by using a strategy network;

dividing the video to be detected into a plurality of video segments, and calculating confidence scores aiming at the < action participant, visual relation and object > triples in each video segment;

inputting the logic rule set and confidence scores of all triples in a video segment into an improved Markov logic network to obtain the occurrence probability of each action in the video segment, wherein the operation relaxation between Boolean variables in the Markov logic network is replaced by a function defined on continuous variables to obtain the improved Markov logic network;

and acquiring an action recognition result of the video to be detected according to the occurrence probability.

2. The method of claim 1, wherein the set of logical rules is obtained by:

1) At time t, calculating the relation predicates R obtained at the previous time _t-1 Embedded feature x of (2) _t-1 ；

2) Will x _t-1 And hidden state h _t-1 Inputting the data into a gated recurrent neural network GRU;

3) Based on the output of GRU, calculating the relation predicate R at time t _t The generation probability of (2);

4) Sampling a specific relation predicate R by using the generated probability _t ；

5) Predicates R according to the time relations _t Acquiring the sampled probability of formula f;

6) Based on the sampled probabilities, one or more formulas f are put into the formula set of the action to obtain a logic rule set corresponding to the action.

3. The method of claim 1, wherein the video to be detected is sliced by the following strategy:

1) Generating a sliding window having a plurality of different sizes;

2) For a sliding window with the size L, setting the sliding step length of the sliding window to be L/2;

3) And cutting the video to be detected according to the sliding step length to generate a video fragment with the length of L.

4. A method as claimed in claim 3, wherein the < action participant, visual relationship, object > triplet is obtained by:

1) For the video segment, uniformly sampling M video frames;

2) Object o in video frame is detected by using a Faster-RCNN detector with ResNet-101 as backbone network _i ；

3) Detecting object o in each video frame _i The j-th visual relationship e with all action participants p _ij Obtaining the sampling frame<Action participant p, visual relationship e _ij Object o _i >And (5) triad.

5. The method of claim 4, wherein the confidence score is calculated by:

1) For generated triples<Action participant p, visual relationship e _ij Object o _i >Calculating a confidence score s for an action participant p _p Object o _i Confidence score of (2)Visual relationship e _ij Confidence score->

2) According to confidence score s _p Confidence scoreConfidence score->A confidence score for the entire triplet is calculated.

6. The method of claim 1, wherein the probability of occurrence of each action in the video segment is obtained by:

1) Converting a formula f in a formula rule into a hornclause according to a function defined on a continuous variable and a transformation criterion in first-order logic;

2) Based on the hornclause and the confidence score, each formula f is calculated _i Values of the examples;

3) According to formula f _i The value of the example gives the formula f _i The number n of true values _i ；

4) Based on the number n _i The probability of occurrence of each action in the video clip is calculated.

7. The method of claim 1, wherein the result of motion recognition across the video is obtained by performing a max pooling operation on the probability of occurrence of each motion in each video segment.

8. The method of claim 1, wherein the improved markov logic network and the policy network that generated the rules for each action are trained by:

1) Rule policy based network pi ^l Generating a set of logical rulesObtaining improved Markov logic network by maximizing log likelihood method>Weights of (2);

2) Fixed improved Markov logic networkUsing a policy gradient algorithm and updating the rule policy network parameters by maximizing the reward function to obtain a rule policy network pi ^l+1 Wherein the reward function is an action recognition evaluation index;

3) When a rule policy network pi ^l Improved Markov logic networkAnd when the set condition is met, obtaining a trained rule strategy network and an improved Markov logic network.

9. A storage medium having a computer program stored therein, wherein the computer program is arranged to perform the method of any of claims 1-8 when run.

10. An electronic device comprising a memory, in which a computer program is stored, and a processor arranged to run the computer program to perform the method of any of claims 1-8.