CN104698839B

CN104698839B - A kind of multiple agent fault detect based on information interaction and compensating control method

Info

Publication number: CN104698839B
Application number: CN201410832047.4A
Authority: CN
Inventors: 方浩; 陈杰; 李俨
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2014-12-26
Filing date: 2014-12-26
Publication date: 2016-04-27
Anticipated expiration: 2034-12-26
Also published as: CN104698839A

Abstract

The present invention is directed to current distributed multi agent system easily to break down, and without simple and feasible this problem of real time fail processing scheme, propose a kind of distributed real-time fault detection based on information interaction and compensating control method.Step one, system and fault modeling: described modeling comprises nodes dynamics model, Information Interaction Model, typical fault model; Step 2, multiple agent real-time fault detection based on information interaction; Step 3, based on the information integration of Gossip algorithm and process; The compensation rate of step 4, Control-oriented amount calculates and applies; Step 5, the connectedness designed based on two hop-informations keep: from Information Interaction Model, Content of Communication between malfunctioning node is analyzed, by utilizing two hop-informations wherein, setting up virtual information transfer path, ensureing that fault handling scheme can not the normal work of influential system.

Description

A multi-agent fault detection and compensation control method based on information interaction

技术领域 technical field

本发明涉及一种基于信息交互的多智能体故障检测与补偿控制方法，属于多智能体控制技术领域。 The invention relates to a multi-agent fault detection and compensation control method based on information interaction, which belongs to the technical field of multi-agent control.

背景技术 Background technique

近些年，随着计算机及网络技术的迅猛发展，多智能体系统的规模也在飞速增长。传统的集中式控制方案，因受到中央节点运算速度及感知范围的限制，已经越来越难以满足实际问题的需求。而分布式的控制方案，因其对单个智能体自身的要求较低，且具有良好的可扩展性而逐渐成为多智能体控制研究的主流。但是值得注意的是，由于在分布式的控制方案中，并不存在一个中央节点来统筹规划所有节点的行为，这使得系统很容易受到故障节点和恶意节点的攻击，严重时可能导致整个系统的瘫痪。因此，对于分布式的多智能体系统而言，设计一套安全高效的故障检测方案，使系统能自动完成对故障节点的检测与修复，是一项紧迫且拥有广阔应用前景的工作。 In recent years, with the rapid development of computer and network technology, the scale of multi-agent systems is also growing rapidly. The traditional centralized control scheme has become more and more difficult to meet the needs of practical problems due to the limitation of the computing speed and sensing range of the central node. The distributed control scheme has gradually become the mainstream of multi-agent control research because of its low requirements for a single agent itself and good scalability. However, it is worth noting that in the distributed control scheme, there is no central node to coordinate and plan the behavior of all nodes, which makes the system vulnerable to attacks from faulty nodes and malicious nodes, which may lead to the failure of the entire system in severe cases. paralysis. Therefore, for a distributed multi-agent system, designing a safe and efficient fault detection scheme so that the system can automatically complete the detection and repair of faulty nodes is an urgent task with broad application prospects.

针对分布式多智能体系统的故障检测，现有的解决方案主要有以下几种： For the fault detection of distributed multi-agent systems, the existing solutions mainly include the following:

方案1：文献(I.Shames,A.M.H.Teixeira,H.Sandberg,andK.H.Johansson.Distributedfaultdetectionforinterconnectedsecondordersystem[J].Automatica,Oct.2011,toappear.)和文献(S.SundaramandC.N.Hadjicostis.Distributedfunctioncalculationvialineariterationsinthepresenceofmaliciousagents,parti:Attackingthenetwork[C].AmericanControlConference,june2008.)提出采用未知输入观测器(UIO)，通过长时间观测，积累足够的数据来估计系统的初始状态，进而求出系统的最终状态，并以此为依据判断当前节点的运动是否满足预期要求。采用观测器对故障进行实时监测是当前多智能体系统故障诊断的主流方案。故障信号在系统中充当未知输入，驱动观测器产生误差输出，通过利用误差信号对故障进行诊断和补偿。 Scenario 1: Literature (I. Shames, A.M.H. Teixeira, H. Sandberg, and K.H. Johansson. Distributed fault detection for interconnected second order system [J]. Automatica, Oct. 2011, to appear.) and literature (S. Sundaramand C.N. Hadjicostis. :Attackingthenetwork[C].AmericanControlConference, june2008.) proposed to use the unknown input observer (UIO), through long-term observation, accumulate enough data to estimate the initial state of the system, and then calculate the final state of the system, and based on this Determine whether the movement of the current node meets the expected requirements. Using observers to monitor faults in real time is the mainstream solution for fault diagnosis of multi-agent systems. The fault signal acts as an unknown input in the system, drives the observer to generate an error output, and diagnoses and compensates the fault by using the error signal.

利用未知输入观测器(UIO)进行故障诊断有着自身的优势，如物理意义明确，易于理解；不依赖于物理模型，适用范围广等。另外，文献(Chung,W.H.,Speyer,J.L.,&Chen,R.H.Adecentralizedfaultdetectionfilter[J].JournalofDynamicSystems,Measurement,andControl,123(2),237–247,2001)指出，与其它观测器，如Beard-Jones故障检测滤波器(Beard-JonesFaultDetectionFilter)相比，未知输入观测器(UIO)结构相对简单，且很容易应用优化算法得到近似的最优解。但另一方面，该方案也存在一些不足：对于拥有N个邻接节点的节点而言，为检测出其所有邻接节点故障所需要的未知输入观测器(UIO)数目为N+1个。当系统拓扑结构比较复杂，或是节点数目众多时，该方案所需要的数据及计算量将会十分庞大，这将对节点的硬件提出很高的要求。同时该方案的运行还会占用大量的计算资源，对系统的其它控制任务产生不利的影响。 Fault diagnosis using the unknown input observer (UIO) has its own advantages, such as clear physical meaning, easy to understand; does not depend on physical models, and has a wide range of applications. In addition, the literature (Chung, W.H., Speyer, J.L., & Chen, R.H. Adecentralized fault detection filter [J]. Journal of Dynamic Systems, Measurement, and Control, 123(2), 237–247, 2001) pointed out that with other observers, such as Beard-Jones fault detection Compared with the filter (Beard-JonesFaultDetectionFilter), the unknown input observer (UIO) structure is relatively simple, and it is easy to apply the optimization algorithm to obtain an approximate optimal solution. But on the other hand, this scheme also has some shortcomings: for a node with N adjacent nodes, the number of unknown input observers (UIO) required to detect the failure of all its adjacent nodes is N+1. When the system topology is complex or the number of nodes is large, the amount of data and computation required by this solution will be very large, which will place high demands on the hardware of the nodes. At the same time, the operation of this scheme will occupy a large amount of computing resources, which will have adverse effects on other control tasks of the system.

方案2：文献(M.Franceschelli,M.Egerstedt,andA.Giua.Motionprobesforfaultdetectionandrecoveryinnetworkedcontrolsystems[C].AmericanControlConference,pages4358–4363,june2008.)提出采用运动探测器，通过施加额外的激励信号来激励网络化的控制系统，根据系统的响应判断当前系统的运行状态，借此检测出故障节点。与第一种方案不同，该方案采取的是主动检测的方式。对于这一方案，存在的问题主要是实际操作起来比较困难，激励信号的选取，信号施加的时间等都会受到很多条件的制约。 Scheme 2: The literature (M.Franceschelli, M.Egerstedt, and A.Giua.Motion probes for fault detection and recovery in networked control systems [C]. American Control Conference, pages 4358–4363, June 2008.) proposed to use motion detectors to stimulate networked control systems by applying additional excitation signals , according to the response of the system to judge the current operating status of the system, so as to detect the faulty node. Different from the first scheme, this scheme adopts the method of active detection. For this scheme, the main problem is that it is difficult to operate in practice, the selection of excitation signal, the time of signal application, etc. will be restricted by many conditions.

方案3：文献(Guo,M.,Dimarogonas,D.V.,&Johansson,K.H.(2012,June).Distributedreal-timefaultdetectionandisolationforcooperativemulti-agentsystems[C].InAmericanControlConference(ACC),2012(pp.5270-5275).IEEE.)提出利用节点间的数据信息交互，通过接收邻接节点的控制量信息，对其运动状态进行模拟重现，并将其与检测到的邻接节点的实际运动状态进行比较，以此作为依据进行故障检测。该方案最大的优点是计算简便，可操作性强，但也存在一些明显不足，如限制条件过于严苛，适用范围较窄，系统的误操作率过高等。 Scheme 3: Document (Guo, M., Dimarogonas, D.V., & Johansson, K.H. (2012, June). Distributed real-time fault detection and isolation for cooperative multi-agent systems [C]. In American Control Conference (ACC), 2012 (pp.5270-5275). IEEE.) proposed Using the data information interaction between nodes, by receiving the control quantity information of adjacent nodes, the motion state is simulated and reproduced, and compared with the detected actual motion state of adjacent nodes, as a basis for fault detection. The biggest advantage of this scheme is that it is easy to calculate and has strong operability, but there are also some obvious shortcomings, such as too strict restrictive conditions, narrow scope of application, and high misoperation rate of the system.

本发明受上述方案3启发，在充分借鉴吸收其优势的同时，针对其自身所存在的不足，提出了一种基于信息交互的分布式故障检测与补偿控制方案。该方案改进了节点的故障判别机制，通过采用流言传播(Gossip)算法，有效改善系统误操作率过高的问题。同时，方案中对节点间的信息交互内容做了重新设定，使节点能更有效地利用所接收到的信息。另外，考虑到信息交互协议的复杂性，本发明提出一种面向控制量的故障修复方案，使系统对节点信息交互协议的限制大大放宽，扩大了该方案的应用范围。 Inspired by the above scheme 3, the present invention proposes a distributed fault detection and compensation control scheme based on information interaction while fully absorbing its advantages and aiming at its own shortcomings. This scheme improves the fault discrimination mechanism of nodes, and effectively improves the problem of high misoperation rate of the system by adopting the gossip propagation (Gossip) algorithm. At the same time, the scheme resets the content of information interaction between nodes, so that nodes can use the received information more effectively. In addition, considering the complexity of the information exchange protocol, the present invention proposes a control quantity-oriented fault recovery scheme, which greatly relaxes the system's restrictions on the node information exchange protocol and expands the scope of application of the scheme.

发明内容 Contents of the invention

本发明针对目前分布式多智能体系统易发生故障，且无简便可行的实时故障处理方案这一问题，提出一种基于信息交互的分布式实时故障检测与补偿控制方法，通过节点间的信息交互，完成对故障节点的检测、隔离与修复，从而实现对系统故障的及时处理，减小其带来的损失的目的。 Aiming at the problem that the current distributed multi-agent system is prone to failure and there is no simple and feasible real-time fault handling scheme, the present invention proposes a distributed real-time fault detection and compensation control method based on information interaction, through information interaction between nodes , to complete the detection, isolation and repair of faulty nodes, so as to realize the timely processing of system faults and reduce the losses caused by them.

本发明的一种基于信息交互的分布式实时故障检测与补偿控制方法，包括如下步骤： A distributed real-time fault detection and compensation control method based on information interaction of the present invention comprises the following steps:

步骤一、系统及故障建模：所述建模包括节点动力学模型，信息交互模型，典型故障模型；其中节点动力学模型采用单积分器模型，通过一阶微分方程描述节点的运动状态；信息交互模型采用无向图描述，即节点间均可以双向通信，各个独立智能体借此进行信息交互，完成系统控制任务；典型故障模型包括现实中智能体常出现的故障类型； Step 1. System and fault modeling: the modeling includes a node dynamics model, an information interaction model, and a typical fault model; wherein the node dynamics model adopts a single integrator model, and the motion state of the node is described by a first-order differential equation; the information The interaction model is described by an undirected graph, that is, two-way communication between nodes is possible, and each independent agent uses this to perform information interaction and complete system control tasks; the typical fault model includes the fault types that often occur in real agents;

步骤二、基于信息交互的多智能体实时故障检测：从步骤一所述的节点动力学模型的表达式中选取相关的状态变量作为对节点运行状态的描述；通过设定门限函数，对节点的运行状态进行划分，区别正常节点与故障节点；同时单个节点借助步骤一所述的信息交互模型获得其邻接节点的状态信息，并通过检测算法检测其是否发生故障，形成单节点的检测结果； Step 2. Multi-agent real-time fault detection based on information interaction: select relevant state variables from the expression of the node dynamic model described in step 1 as the description of the node's operating state; by setting the threshold function, the node's The running status is divided to distinguish between normal nodes and faulty nodes; at the same time, a single node obtains the status information of its adjacent nodes by means of the information interaction model described in step 1, and detects whether it is faulty through a detection algorithm to form a detection result of a single node;

步骤三、基于Gossip算法的信息整合与处理：由于通信丢包，时滞等问题的存在，步骤二中单节点检测结果受环境影响较大，可信度不高；因此利用Gossip算法，将单节点检测结果进行信息整合，获得可靠性更高的综合检测结果，并将此作为对节点运行状态的最终判断依据，用以区分正常节点与故障节点； Step 3. Information integration and processing based on Gossip algorithm: Due to the existence of problems such as communication packet loss and time lag, the single-node detection results in step 2 are greatly affected by the environment, and the reliability is not high; therefore, using the Gossip algorithm, the single-node The node detection results are integrated to obtain comprehensive detection results with higher reliability, and this is used as the final judgment basis for the node operating status to distinguish normal nodes from faulty nodes;

步骤四、面向控制量的补偿量计算与施加：若检测到故障节点，则通过相应操作将故障节点隔离，同时从故障节点对其邻接节点控制量的影响出发，设计相关的计算方案，获得补偿量的值，并加至原控制量中，借以抵消故障节点对系统产生的影响； Step 4. Calculation and application of compensation for control quantities: If a faulty node is detected, the faulty node will be isolated through corresponding operations, and at the same time, based on the influence of the faulty node on the control quantity of its adjacent nodes, a relevant calculation scheme will be designed to obtain compensation The value of the quantity is added to the original control quantity to offset the impact of the faulty node on the system;

步骤五、设计基于二跳信息的连通性保持：从信息交互模型出发，对故障节点间的通信内容进行分析，通过利用其中的二跳信息，建立虚拟的信息传输通路，保证故障处理方案不会影响系统的正常工作。 Step 5. Design connectivity maintenance based on two-hop information: starting from the information interaction model, analyze the communication content between faulty nodes, and establish a virtual information transmission path by using the two-hop information to ensure that the fault handling scheme will not affect the normal operation of the system.

其中所述的故障类型包括毁坏性故障、失控性故障和干扰型故障。 The fault types mentioned therein include destructive faults, runaway faults and disturbing faults.

与现有方案相比，本发明的优势与创新之处主要有以下几点： Compared with existing solutions, the advantages and innovations of the present invention mainly include the following points:

(1)针对现有方案大多对系统硬件要求较高，需要占用大量计算资源的问题(如方案1、2中所示)，本发明从多智能体系统的基本控制规则着手，充分利用其现有的计算结果，在占用极少计算资源的条件下，即实现了对邻接节点的实时监测，大大降低了本发明的应用成本。同时，本发明以少量增加通信内容为代价，通过利用gossip算法，有效克服了随机信号对故障检测结果的干扰问题(如方案3中所示)，这一创新保证了故障检测结果的可靠性，也使得本发明具有实际应用的价值。 (1) In view of the problem that most of the existing schemes have high requirements on system hardware and need to occupy a large amount of computing resources (as shown in schemes 1 and 2), the present invention starts from the basic control rules of the multi-agent system, and makes full use of its existing In some calculation results, the real-time monitoring of adjacent nodes is realized under the condition of occupying very few calculation resources, which greatly reduces the application cost of the present invention. Simultaneously, at the expense of a small increase in communication content, the present invention effectively overcomes the interference problem of random signals to fault detection results (as shown in scheme 3) by utilizing the gossip algorithm, and this innovation ensures the reliability of fault detection results, It also makes the present invention have practical application value.

(2)现有方案对故障节点的隔离与修复研究不多，大部分都是采用简单的直接终止通信的方式，且故障修复方案也只适用于线性控制协议(如方案3中所示)，应用范围受到限制。本发明从系统的控制结果入手，设计了一种基于控制量的故障隔离与补偿算法，该算法充分考虑了系统最为常见的饱和特性，可有效解决非线性控制协议下系统对故障的修复问题，大大扩展了本发明的适用范围。 (2) There are not many studies on the isolation and repair of faulty nodes in the existing schemes, and most of them adopt a simple way of directly terminating communication, and the fault repair scheme is only applicable to linear control protocols (as shown in scheme 3), The scope of application is limited. The present invention starts from the control results of the system, and designs a fault isolation and compensation algorithm based on the control quantity. This algorithm fully considers the most common saturation characteristics of the system, and can effectively solve the problem of system fault repair under the nonlinear control protocol. The scope of application of the present invention is greatly expanded.

(3)对于系统在故障节点被隔离后如何进行连通性保持的问题，现有方案对此均没有做深入的研究。本发明通过借助现有的通信内容和利用gossip算法得到的可靠检测信息，设计了一种基于二跳信息的系统拓扑结构保持方案，该方案可以保证若故障节点未脱离正常节点的通信范围，即可借助其传递的邻接节点的信息建立虚拟信息传输通路，保证系统时刻连通，其正常功能不会因故障节点隔离而被完全破坏。 (3) For the problem of how to maintain the connectivity of the system after the fault node is isolated, none of the existing solutions has done in-depth research on this. The present invention designs a system topology structure maintenance scheme based on two-hop information by utilizing the existing communication content and reliable detection information obtained by using the gossip algorithm. This scheme can ensure that if the fault node does not leave the communication range of the normal node, that is The virtual information transmission path can be established by means of the information of the adjacent nodes transmitted by it to ensure that the system is connected at all times, and its normal functions will not be completely destroyed due to the isolation of faulty nodes.

附图说明 Description of drawings

图1—多智能体系统拓扑结构图； Figure 1—Topological structure diagram of multi-agent system;

图2—故障检测方案示意图； Figure 2—Schematic diagram of fault detection scheme;

图3—基于流言传播(Gossip)算法的信息处理方案示意图； Figure 3—Schematic diagram of an information processing scheme based on the Gossip algorithm;

图4—节点期望输出与实际输出关系图； Figure 4—The relationship between node expected output and actual output;

图5—利用二跳信息后的系统拓扑结构图； Figure 5—the system topology structure diagram after using the two-hop information;

图6—故障1有无处理方案结果对比图； Figure 6—Comparison of results with or without a treatment plan for fault 1;

图7—故障2有无处理方案结果对比图； Figure 7—Comparison of results with or without a treatment plan for fault 2;

图8—故障3有无处理方案结果对比图； Figure 8—Comparison of results with or without a treatment plan for fault 3;

图9-基于信息交互的分布式实时故障检测与补偿控制方法流程图。 Fig. 9 - Flow chart of distributed real-time fault detection and compensation control method based on information interaction.

具体实施方式 detailed description

下面结合附图和实例对本发明做进一步说明： Below in conjunction with accompanying drawing and example the present invention will be further described:

首先给出系统及检测模型： First give the system and detection model:

在实际的多智能体系统中，通常将节点的运动信息作为目标进行控制，以求实现节点的运动状态或是位置分布满足控制要求。对于采用单积分器模型的节点而言，其动力学模型满足如下形式： In the actual multi-agent system, the node's motion information is usually controlled as the target, in order to realize the node's motion state or position distribution to meet the control requirements. For a node using a single integrator model, its dynamic model satisfies the following form:

${\overset{\cdot &Center Dot;}{x x}}_{i i} ((t t)) = = {u u}_{i i} ((t t)) - - - - - - ((11))$

该式表明节点的控制量取决于节点状态的导数，一般而言即为节点的速度信息。式(1)给出的是在连续时间状态下节点的动力学模型，但在实际的控制系统中，由于节点需要对状态信息进行采样，且采样周期不可能无限小，因此，需要对节点建立离散时间状态下的动力学模型。由计算机控制系统等相关学科的知识可知，对上述模型进行离散化处理后得到的离散时间状态下典型单积分器模型有如下形式： This formula shows that the control quantity of the node depends on the derivative of the node state, generally speaking, it is the speed information of the node. Equation (1) gives the dynamic model of the node in the continuous time state, but in the actual control system, because the node needs to sample the state information, and the sampling period cannot be infinitely small, therefore, it is necessary to establish Kinetic models in discrete-time regimes. According to the knowledge of computer control system and other related disciplines, the typical single integrator model in the discrete time state obtained after discretizing the above model has the following form:

z_i((k+1)T)＝z_i(kT)+u_i(kT)T,i＝1,…,N(2) z _i ((k+1)T)=z _i (kT)+u _i (kT)T,i=1,...,N(2)

其中T是采样时间。为简便起见，记z_i ^k＝z_i(kT)，u_i ^k＝u_i(kT)，并且满足其中z_i ^k是节点在二维空间中的位置坐标，u_i ^k∈R²是节点i在每个时间步长k内的控制量。该模型代表的实际物理意义是：将系统中节点的位置作为控制目标，通过在每个时间步长k内控制节点速度的大小来实现节点位置的调整，最终使节点的分布状况达到控制要求。无论是借助有线还是无线网络进行信息传输，多智能体系统都是以节点间信息交流为基础实现协同控制的。整个信息交互网络可用图G＝{V,E}来描述，其中V＝{1，…,N}是图中的顶点，同时也代表系统中的各个节点，是图中的道路。我们定义：若节点i能够将自身信息传输给j，则称i为j的邻接节点，即(i,j)∈E。记N_i ^k＝{i₁,…,i_p}为时间步长k内节点i的邻接节点集，|N_i ^k|为其基数。除此之外，我们假定G为无向图，也就是说 $(i, j) &Element; E &DoubleLeftRightArrow; (j, i) &Element; E .$ where T is the sampling time. For simplicity, write z _i ^k = z _i (kT), u _i ^k = u _i (kT), and satisfy where z _i ^k is the position coordinate of the node in two-dimensional space, u _i ^k ∈ R ² is the control quantity of node i in each time step k. The actual physical meaning represented by this model is: take the position of the nodes in the system as the control target, adjust the node position by controlling the speed of the nodes in each time step k, and finally make the distribution of the nodes meet the control requirements. Regardless of information transmission via wired or wireless networks, multi-agent systems are based on information exchange between nodes to achieve collaborative control. The entire information interaction network can be described by a graph G={V,E}, where V={1,...,N} is a vertex in the graph, and also represents each node in the system, is the road in the figure. We define: if node i can transmit its own information to j, then i is called the adjacent node of j, that is (i, j)∈E. Denote N _i ^k = {i ₁ ,...,i _p } as the set of adjacent nodes of node i within time step k, and |N _i ^k | is its base. In addition, we assume that G is an undirected graph, that is, $(i, j) &Element; E. &DoubleLeftRightArrow; (j, i) &Element; E. .$

另外，假定节点的控制规则有如下结构： In addition, it is assumed that the control rule of the node has the following structure:

u_i ^k＝P_i(z_i ^k,I_i ^k)(3) u _i ^k ＝P _i (z _i ^k ,I _i ^k )(3)

其中P_i:R²→R²为控制协议，由节点的控制目标决定；是时间步长k内节点i邻接节点的状态，其中N_i ^k＝{i₁,…,i_p}，p＝|N_i ^k|。式(3)中所示的结构为多智能体协同控制中常用的结构，即节点的控制量由其当前状态及其所有邻接节点的状态共同决定。P＝{P₁，…P_N}为预先设定的控制协议，若满足则称P为齐次的控制协议，否则称其为非齐次的控制协议。本发明只考虑信息交互协议为齐次的情况。 Among them, P _i : R ² → R ² is the control protocol, which is determined by the control target of the node; is the state of adjacent nodes of node i within time step k, where N _i ^k ={i ₁ ,…,i _p }, p=|N _i ^k |. The structure shown in formula (3) is a commonly used structure in multi-agent cooperative control, that is, the control amount of a node is determined by its current state and the states of all adjacent nodes. P={P ₁ ,...P _N } is a preset control protocol, if it satisfies Then P is called a homogeneous control protocol, otherwise it is called a non-homogeneous control protocol. The present invention only considers the situation that the information exchange protocol is homogeneous.

记系统中所有故障节点的集合为F，时间步长k内节点i对节点j的检测结果为满足： Denote the set of all faulty nodes in the system as F, and the detection result of node i to node j within time step k is satisfy:

${q q}_{i i,, j j}^{k k} = = \{\begin{matrix} 00,, j j &NotElement; &NotElement; F f \\ 11,, j j &Element; &Element; F f \end{matrix} - - - - - - ((44))$

在每个时间步长k内，节点会对其所有邻接节点进行故障检测，同时获得检测结果另外，定义系统对节点的检测结果为该节点所有邻接节点对其检测结果的综合，其形式如下： In each time step k, the node will perform fault detection on all its adjacent nodes and obtain the detection results at the same time In addition, the detection result of the system for a node is defined as the synthesis of the detection results of all adjacent nodes of the node, and its form is as follows:

${Q Q}_{i i}^{k k} = = {Σ Σ}_{j j = = {i i}_{11}}^{{i i}_{p p}} {q q}_{j j,, i i}^{k k} / / p p - - - - - - ((55))$

其中N_i ^k＝{i₁,…,i_p}，p＝|N_i ^k|。通过数据信息交互，每个节点都会获得系统对其自身的评价结果，具体的获取方式将在下文中做详细陈述。 Where N _i ^k ={i ₁ ,...,i _p }, p=|N _i ^k |. Through data information interaction, each node will obtain the evaluation result of the system itself, and the specific acquisition method will be described in detail below.

下面对节点间的信息交互模型进行分析： The following is an analysis of the information interaction model between nodes:

在多智能体系统中，各个节点通过感知周围环境来对自身进行控制。若节点对环境的感知是基于节点间的相互信息交互，则这种模型就被称为基于信息交互的模型。在本发明所讨论的系统信息交互模型中，节点之间的信息交互内容由以下部分组成： In a multi-agent system, each node controls itself by sensing the surrounding environment. If the node's perception of the environment is based on the mutual information interaction between nodes, this model is called a model based on information interaction. In the system information interaction model discussed in the present invention, the content of information interaction between nodes consists of the following parts:

内容1：节点i∈V在时间步长k内将由式(3)求得的控制量u_i ^k以及自身当前状态z_i ^k传输给其所有的邻接节点j∈N_i ^k。 Content 1: Node i∈V transmits the control quantity u _i ^k obtained by formula (3) and its current state z _i ^k to all its adjacent nodes j∈N _i k within time step ^k .

内容2：节点i∈V在时间步长k内将其邻接节点的状态及由式(5)获得的系统对其邻接节点的检测结果{j∈N_i ^k|Q_j ^k}传输给其所有的邻接节点j∈N_i ^k。 Content 2: Node i∈V changes the state of its adjacent nodes within time step k And the detection result {j∈N _i ^k |Q _j ^k } obtained by the system to its adjacent nodes obtained by formula (5) is transmitted to all its adjacent nodes j∈N _i ^k .

内容3：节点i∈V在时间步长k内将其对相应邻接节点的检测结果以及邻接节点对i的检测结果传输给其所有的邻接节点j∈N_i ^k。 Content 3: Node i∈V compares its detection results to the corresponding adjacent nodes within time step k And the detection results of adjacent nodes for i It is transmitted to all its adjacent nodes j∈N _i ^k .

值的说明的是，内容3中传输的邻接节点对i的检测结果并不是Q_i ^k的形式，虽然两者在意义上完全等价。此处主要考虑节点i为恶意节点的情况，若直接传输Q_i ^k，该数据可能会被恶意节点刻意修改而使得系统无法检测出该恶意节点。采用的形式，由于数据中包含节点自身对恶意节点的检测信息，可用来进行信息校对，或者通过与恶意节点的邻接节点进行数据校对来确认检测结果。这属于信息对抗的研究范畴，本发明对此不作详细讨论。 The description of the value is the detection result of the adjacent node pair i transmitted in content 3 is not the form of Q _i ^k , although the two are completely equivalent in meaning. Here we mainly consider the case that node i is a malicious node. If Q _i ^k is directly transmitted, the data may be intentionally modified by the malicious node so that the system cannot detect the malicious node. use Since the data contains the detection information of the node itself to the malicious node, it can be used to check the information, or confirm the detection result by checking the data with the adjacent nodes of the malicious node. This belongs to the research category of information confrontation, which is not discussed in detail in the present invention.

下面给出典型故障类型： Typical failure types are given below:

在对系统故障进行定义时，由于不同系统的组成及运行方式各不相同，其对故障的定义方式也有所不同。对于功能划分相对独立且结构完全可知的系统，可从故障产生的原因着手对其进行定义。例如，对于一辆汽车而言，可从动力系统、制动系统等方面具体定义发动机损坏、刹车失灵等故障。这样做的好处是针对性强，而且可以最大限度减小对系统的破坏，保证其功能不受影响，现实世界中的大部分系统都采用这种故障定义方式。但是，对于多智能体系统而言，由于其运行方式复杂多样，拓扑结构也各不相同，往往很难确切得知具体的故障原因，因此也就无法从源头出发定义故障。考虑到多智能体系统是由多个智能体协同完成控制任务，单个智能体对系统并不能产生决定性的影响，因此，可以考虑不具体分析故障产生的原因，而是从节点的实际运行结果着手进行定义，即只要某一节点的运行结果不满足系统要求，就认定其发生故障，并将其从系统中剔除。这种故障定义方式会对系统产生一定的破坏，但相比于为维修某一节点故障使整个系统停止运行而言，其损失仍相对较小。另外，这种定义方式能大大简化系统检测故障的难度，且可对故障进行实时处理，保证系统在有故障存在的情况下仍能最大限度完成预期任务。现结合实际的多智能体系统，给出如下几种典型的故障形式： When defining system faults, due to the different composition and operation modes of different systems, the definition of faults is also different. For a system whose functional division is relatively independent and whose structure is fully known, it can be defined starting from the cause of the fault. For example, for a car, faults such as engine damage and brake failure can be specifically defined from the aspects of power system and braking system. The advantage of this is that it is highly targeted, and it can minimize damage to the system and ensure that its functions are not affected. Most systems in the real world use this fault definition method. However, for a multi-agent system, due to its complex and diverse operating modes and different topological structures, it is often difficult to know the specific cause of the fault, so it is impossible to define the fault from the source. Considering that a multi-agent system is a multi-agent system that cooperates to complete the control task, a single agent cannot have a decisive impact on the system. Therefore, it can be considered not to analyze the cause of the fault in detail, but to start from the actual operation results of the nodes Define it, that is, as long as the operation result of a certain node does not meet the system requirements, it will be deemed to be faulty and removed from the system. This way of fault definition will cause certain damage to the system, but compared with stopping the whole system to repair a node failure, the loss is still relatively small. In addition, this definition method can greatly simplify the difficulty of detecting faults in the system, and can handle faults in real time, ensuring that the system can still complete the expected tasks to the maximum extent in the presence of faults. Now combined with the actual multi-agent system, the following typical fault forms are given:

故障1：毁坏型故障。具体表现为节点在运行过程中非正常地停止运动，或是虽然有运动趋势，但实际的状态却并未按预期发生改变。产生此种故障原因可能是节点受到外部攻击，使得动力系统损毁，或者是节点的能量耗尽，失去动力来源。另外，考虑系统运行过程中的一种特殊情况，即节点受周围环境或自身程序的影响无法继续运动，例如节点卡在某个无法移动的地形上，或是控制程序存在缺陷，使节点陷入局部极值点等，这种情况下虽然节点本身并未受到损毁，但其已无法正常运动，故仍将其归于毁坏型故障之列。 Fault 1: Destructive fault. The specific performance is that the node stops abnormally during operation, or although there is a movement trend, the actual state does not change as expected. The cause of this kind of failure may be that the node is attacked by the outside, causing the power system to be damaged, or the energy of the node is exhausted, and the source of power is lost. In addition, consider a special situation in the process of system operation, that is, the node cannot continue to move due to the influence of the surrounding environment or its own program. In this case, although the node itself is not damaged, it cannot move normally, so it is still classified as a destructive fault.

故障2：失控型故障。具体表现为节点运动不受控制，速度保持不变，或是非常规地发生改变，使得控制效果无法满足系统要求。产生的原因可能是控制系统发生错误，无法正常生成控制信息，或者节点的动力系统与控制系统失去联系，执行器无法获得正确的控制量。另外，当系统受到恶意攻击时最容易产生此种故障，可将其作为检测系统中是否有恶意节点的标志之一，如检测到该故障出现，应及时采取防范措施，防止恶意信息的进一步扩散。 Fault 2: Out-of-control fault. The specific performance is that the node motion is uncontrolled, the speed remains unchanged, or changes unconventionally, so that the control effect cannot meet the system requirements. The reason may be that the control system fails to generate control information normally, or the power system of the node loses contact with the control system, and the actuator cannot obtain the correct control amount. In addition, this kind of fault is most likely to occur when the system is under malicious attack. It can be used as one of the signs to detect whether there are malicious nodes in the system. If this fault is detected, preventive measures should be taken in time to prevent the further spread of malicious information .

故障3：干扰型故障。具体表现为节点出现大量无规则的运动，实际运行状态与理论运行状态偏差过大，已对系统的正常运行产生危害。造成此种故障的原因很多，在实际的多智能体系统中也最为常见。具体原因可能是节点受强烈的外部随机干扰影响，如地形过于崎岖，或是节点执行元件的精密程度不足，产生的随机误差太大等。对于此类故障，在处理时应持谨慎态度，因为误差的出现是不可避免的，若检测程序过于严苛，可能会使大量节点被认定发生故障，这将给系统带来不必要的损失。为解决此类问题，可考虑采用滤波算法等对其进行补偿处理。 Fault 3: Interference type fault. The specific performance is that there are a large number of irregular movements of the nodes, and the deviation between the actual operating state and the theoretical operating state is too large, which has caused harm to the normal operation of the system. There are many reasons for this failure, and it is the most common in the actual multi-agent system. The specific reason may be that the node is affected by strong external random interference, such as the terrain is too rough, or the precision of the node actuator is insufficient, and the random error generated is too large. For such faults, one should be cautious when dealing with them, because the occurrence of errors is inevitable. If the detection procedure is too strict, a large number of nodes may be identified as faulty, which will bring unnecessary losses to the system. In order to solve this kind of problem, it can be considered to use filtering algorithm to compensate it.

另外，对于基于信息交互的模型而言，节点对周围环境的感知及自身控制量的获取完全依靠与邻接节点的数据信息交互，因此，数据信息交互是节点与系统联系的桥梁，对节点的正常运行起着至关重要的作用。针对上述三种故障类型，若节点只是控制器或动力系统发生故障，但还保留有正常的信息交互功能，则称其为Ⅰ类故障；若节点信息交互系统被破坏，无法正常进行数据信息交互，则称其为Ⅱ类故障。 In addition, for the model based on information interaction, the node's perception of the surrounding environment and the acquisition of its own control quantity completely depend on the data information interaction with adjacent nodes. Therefore, the data information interaction is the bridge between the node and the system. Running plays a vital role. For the above three fault types, if the node is only the controller or the power system fails, but still retains the normal information interaction function, it is called a type I fault; if the node information interaction system is destroyed, the normal data information exchange cannot be performed , it is called a Type II fault.

下面给出基于信息交互的故障检测方案的具体实施办法： The specific implementation method of the fault detection scheme based on information interaction is given below:

由上文的讨论可知，本发明是针对节点的控制效果进行的故障定义，即一旦节点的实际运行状态不满足控制要求，即认为其发生了故障。由此可以很自然地想到一种故障检测方案：检测系统的运行状态，若某一节点理论运行状态与实际运行状态产生误差r，且该误差超出一定范围，即断定节点发生故障。 As can be seen from the above discussion, the present invention is based on the fault definition of the control effect of the node, that is, once the actual operating state of the node does not meet the control requirements, it is considered to be faulty. From this, it is natural to think of a fault detection scheme: detect the operating state of the system, if there is an error r between the theoretical operating state and the actual operating state of a certain node, and the error exceeds a certain range, it is concluded that the node is faulty.

由于本发明中所考虑的节点模型为单积分器模型，输出反映在节点的连续位移z_i ^k+1-z_i ^k，或者说是节点的速率输出u_i ^k上，所以我们用u_i ^k作为计算系统残差信号的性能指标：r_i ^k＝u_i ^r,k-u_i ^a,k，其中u_i ^r,k∈R²是在时间步长k内通过控制协议P求得的系统理论运动状态，u_i ^a,k∈R²是通过实时测量得到的系统实际运动状态，满足： Since the node model considered in this invention is a single integrator model, the output is reflected in the continuous displacement z _i ^k+1 -z _i ^k of the node, or the velocity output u _i ^k of the node, so we use u _i ^k As the performance index for calculating the residual signal of the system: r _i ^k =u _i ^r,k -u _i ^a,k , where u _i ^r,k ∈R ² is the system obtained by the control protocol P within the time step k The theoretical motion state, u _i ^a,k ∈ R ² is the actual motion state of the system obtained through real-time measurement, satisfying:

u_i ^r,k＝u_i ^k＝P_i(z_i ^k,I_i ^k)(6) u _i ^r,k ＝u _i ^k ＝P _i (z _i ^k ,I _i ^k )(6)

u_i ^a,k＝h(z_i ^k+1,z_i ^k)(7) u _i ^a,k ＝h(z _i ^k+1 ,z _i ^k )(7)

若节点i的状态是连续的，则z_i ^k+1和z_i ^k可以通过节点内置的传感器测量得到，而h则可使用简单的一阶微分方程形式(z_i ^k+1-z_i ^k)/[(k+1)T-kT]。 If the state of node i is continuous, then z _i ^k+1 and z _i ^k can be measured by the built-in sensor of the node, and h can be obtained using a simple first-order differential equation (z _i ^{k+1 -} z _i ^k )/[(k+1)T-kT].

现对故障节点做出如下定义： The fault node is now defined as follows:

定义1：对于采用单积分器模型的节点i，若其满足式(8)所述条件，则称其为故障节点。 Definition 1: For a node i using a single integrator model, if it satisfies the conditions described in formula (8), it is called a faulty node.

||r_i ^k||＝||u_i ^r,k-u_i ^a,k||＞χ(||u_i ^r,k||,δ)(8) ||r _i ^k ||＝||u _i ^r,k -u _i ^a,k ||＞χ(||u _i ^r,k ||,δ)(8)

其中，χ(||u_i ^r,k||,δ)称为门限函数，它的值取决于输入信号的大小||u_i ^r,k||和扰动量δ。一般可以取χ(||u_i ^r,k||,δ)＝γ₁+γ₂||u_i ^r,k||，其中常量γ₁取决于扰动量δ，时变量γ₂||u_i ^r,k||取决于节点的瞬时输入。 Among them, χ(||u _i ^r,k ||,δ) is called the threshold function, and its value depends on the size of the input signal ||u _i ^r,k || and the disturbance δ. Generally, χ(||u _i ^r,k ||,δ)=γ ₁ +γ ₂ ||u _i ^r,k ||, where the constant γ ₁ depends on the disturbance δ, and the time variable γ ₂ ||u _i ^r,k || depends on the instantaneous input of the node.

对于可能包含故障节点的系统，我们的控制目标是：系统能够完成原定任务，同时检测并隔离故障节点。由于故障节点无法参与原定任务，故规定：若未发生故障的节点都完成了原定的任务，即认为整个系统完成了预期目标。 For systems that may contain faulty nodes, our control goal is: the system can complete the original task, while detecting and isolating the faulty nodes. Since the faulty node cannot participate in the original task, it is stipulated that if all the nodes that have not failed have completed the original task, it is considered that the entire system has completed the expected goal.

如图2所示，本发明提出如下故障检测方案： As shown in Figure 2, the present invention proposes the following fault detection scheme:

假设此时节点j正在对节点i进行故障检测，在时间步长k内，通过上文所述的信息交互内容1和2，节点j可获得节点i此刻的状态z_i ^k以及其所有邻接节点信息由于信息交互协议是齐次的，故节点j可利用自身的控制协议与I_i ^k求得节点i的理论控制量u_i ^r,k。在下一个时间步长k+1中，类似地，节点j可获得z_i ^k+1和u_i ^r,k+1，并利用式(7)求得u_i ^a,k。此时，节点j就可利用式(8)来判断节点i在时间步长k内是否发生故障。 Assuming that node j is performing fault detection on node i at this time, within time step k, through information interaction content 1 and 2 described above, node j can obtain the state z _i ^k of node i at this moment and all its adjacent nodes information Since the information exchange protocol is homogeneous, node j can use its own control protocol and I _i ^k to obtain the theoretical control quantity u _i ^r,k of node i. In the next time step k+1, similarly, node j can obtain z _i ^k+1 and u _i ^r,k+1 , and use formula (7) to obtain u _i ^a,k . At this point, node j can use formula (8) to judge whether node i fails within time step k.

直观地说，该故障检测方案就是通过获得目标节点邻接节点的信息，借助齐次的信息交互协议求得目标节点的理论运动状态，并将其与探测到的实际运动状态进行比较，若误差超过一定幅值，则判定节点发生故障。 Intuitively speaking, the fault detection scheme is to get the theoretical motion state of the target node by obtaining the information of the adjacent nodes of the target node and using the homogeneous information interaction protocol, and compare it with the detected actual motion state. If the error exceeds A certain amplitude, it is determined that the node is faulty.

上文所述的故障检测方案得到的是单个节点的检测结果，其受随机因素影响较大，结果的可信度不高。例如，在节点的信息交互过程中常存在时延及数据丢失等现象，若节点并未及时接收到某一邻接节点的信息，或接收到的信息不完整，则很有可能将该邻接节点误判断为故障节点而对其采取隔离等操作，这些操作将被其邻接节点视为故障，导致该节点本身被检测为故障节点。这样下去，系统中将会有大量正常节点因误操作而被隔离，造成严重的资源浪费，甚至可能导致全局目标无法实现。考虑到这些情况，本发明提出采用流言传播(Gossip)算法，对各节点的检测结果进行信息处理，借此提高检测结果的准确度。方案示意图如图3所示，具体实现方案如下： The fault detection scheme mentioned above obtains the detection result of a single node, which is greatly affected by random factors, and the reliability of the result is not high. For example, there are often delays and data loss in the process of information interaction between nodes. If a node does not receive the information of an adjacent node in time, or the received information is incomplete, it is very likely that the adjacent node will be misjudged. Taking actions such as isolating the faulty node, these operations will be regarded as faulty by its adjacent nodes, causing the node itself to be detected as a faulty node. If this continues, a large number of normal nodes in the system will be isolated due to misoperation, resulting in a serious waste of resources, and may even lead to failure to achieve the global goal. Considering these situations, the present invention proposes to use a Gossip algorithm to process information on the detection results of each node, thereby improving the accuracy of the detection results. The schematic diagram of the scheme is shown in Figure 3, and the specific implementation scheme is as follows:

以节点i为例，首先，在时间步长k内，节点i独自进行故障诊断工作，利用式(8)与式(4)得到其对所有邻接节点的诊断结果与此同时，i的所有邻接节点也在进行同样的操作。接着，如信息交互内容3所示，节点i将对邻接节点的诊断结果分别传送给其邻接节点，同时接收到邻接节点对i的诊断结果最后，节点i将邻接节点对自己的综合检测结果传送给其所有邻接节点，同时接收到其邻接节点的综合检测结果。这样，利用式(5)，节点i就可以计算出系统对其邻接节点的检测结果通过设立参数Q_con，当时，即可判断节点j发生故障。一般而言，参数Q_con为(0,1]区间上的常数，其取值要受节点执行的精度，环境干扰的强度，节点间的信息交互质量等因素的影响。Q_con的值越大，系统对故障的检测结果可靠性越高，但漏检的概率也越大，因此，Q_con的值应根据实际系统的不同适当选取。 Taking node i as an example, first of all, within the time step k, node i performs fault diagnosis alone, and uses formula (8) and formula (4) to obtain its diagnosis results for all adjacent nodes At the same time, all adjacent nodes of i are doing the same operation. Next, as shown in information interaction content 3, node i transmits the diagnostic results of the adjacent nodes to its adjacent nodes respectively, and at the same time receives the diagnostic results of the adjacent nodes for i Finally, the node i takes the comprehensive detection results of the adjacent nodes to itself Send it to all its adjacent nodes, and receive the comprehensive detection results of its adjacent nodes at the same time. In this way, using formula (5), node i can calculate the detection result of the system to its adjacent nodes By setting up the parameter Q _con , when , it can be judged that node j is faulty. Generally speaking, the parameter Q _con is a constant on the (0,1] interval, and its value is affected by factors such as the accuracy of node execution, the intensity of environmental interference, and the quality of information interaction between nodes. The larger the value of Q _con , the higher the reliability of the system's detection results for faults, but the greater the probability of missed detection. Therefore, the value of Q _con should be selected appropriately according to the actual system.

下面给出故障隔离与修复方案的具体实施办法： The specific implementation methods of the fault isolation and repair plan are given below:

故障检测完成之后，系统往往需要对故障节点进行隔离工作，以消除其对剩余正常节点的影响。另外，在分布式多智能体系统中，由于各个节点的故障检测任务是独自进行的，很有可能出现故障节点被其邻接节点在不同时刻检测到的情况。而且，由于节点之间需要应用流言传播(Gossip)算法对检测结果进行信息处理，这也将带来一定程度的时延。因此，节点发生故障的时间与节点被系统诊断为故障节点的时间通常是不一致的。在这段时间内，故障节点仍然作用于系统，使最终的控制结果产生偏差。为了消除该影响，本章将提出一种通过施加外部信号，对系统进行控制量补偿的故障修复算法。 After the fault detection is completed, the system often needs to isolate the faulty node to eliminate its influence on the remaining normal nodes. In addition, in a distributed multi-agent system, since the fault detection task of each node is carried out independently, it is very likely that the fault node is detected by its adjacent nodes at different times. Moreover, because the nodes need to apply the gossip propagation (Gossip) algorithm to process the detection results, this will also bring a certain degree of delay. Therefore, the time when a node fails is usually inconsistent with the time when the node is diagnosed as a faulty node by the system. During this time, the faulty nodes still act on the system, causing deviations in the final control results. In order to eliminate this effect, this chapter will propose a fault recovery algorithm that compensates the control amount of the system by applying an external signal.

通过上文论述可以发现，系统中各节点检测到故障节点的时间可能是不一致的，若每个节点在自己检测到故障节点的时刻就进行故障的隔离与修复，则该隔离修复操作很有可能被其邻接节点诊断为故障而对其采取同样的操作。这种情形将会逐级扩散下去，最终导致整个系统的崩溃。因此，有必要给各节点规定一个时刻来统一进行对故障的操作。我们引进一个新的参数：故障检测与修复周期，记作T_p＝p*T。其中常数p*∈Z⁺，T是采样时间。在每个周期T_p中，节点在k∈[k^*T_p+T,(k^*+1)T_p-T]时间段内进行故障检测与信息处理，在k＝(k^*+1)T_p,k∈Z⁺时间段内对故障节点进行隔离与修复。值得注意的是，由于故障隔离与修复是一项非常规操作，很有可能被其邻接节点检测为故障，因此，在时间段k＝(k*+1)T_p,k∈Z⁺内，应暂时屏蔽各节点的故障检测功能。 From the above discussion, it can be found that the time when each node in the system detects the fault node may be inconsistent. If each node isolates and repairs the fault at the moment when it detects the fault node, the isolation repair operation is very likely Take the same action for it that is diagnosed as a failure by its neighbors. This situation will spread step by step, eventually leading to the collapse of the entire system. Therefore, it is necessary to specify a time for each node to perform operations on faults uniformly. We introduce a new parameter: fault detection and repair period, denoted as T _p =p*T. where the constant p*∈Z ⁺ , T is the sampling time. In each cycle T _p , the node performs fault detection and information processing in the k∈[k ^* T _p +T,(k ^* +1)T _p -T] time period, and k=(k ^* +1) T _p , k∈Z ⁺ time period to isolate and repair the faulty nodes. It is worth noting that since fault isolation and repair is an unconventional operation, it is likely to be detected as a fault by its adjacent nodes. Therefore, within the time period k=(k*+1)T _p , k∈Z ⁺ , The fault detection function of each node should be shielded temporarily.

下面给出故障节点的隔离方案： The isolation scheme for the faulty node is given below:

故障隔离是指将发生故障节点的控制量从其邻接节点中去除，同时阻断故障节点接收邻接节点信息的信息交互渠道，以达到消除故障节点影响的目的。很容易想到，当节点检测到其邻接节点发生故障时，只需将该节点从其邻接节点集中去除，同时停止对该节点发送自身状态信息，即可完成隔离工作。注意这里并未中断对故障节点发送自身邻接节点的信息，因为两者并无直接的联系，发送该信息对系统影响不大。终止自身信息发送主要是出于信息的安全性考虑，因为产生故障的原因是未知的，若该节点已被敌对方控制，继续发送数据有可能被节点恶意利用，从而对自身的控制产生影响。但是，终止信息发送也会带来一个问题，即故障节点无法接收到邻接节点的信息，就会将其邻接节点判定为故障节点而同样中断信息发送，这样，当故障节点终止对所有邻接节点发送信息时，它对系统而言将是完全不可见的，其运动对系统产生的危害也将完全无法规避，这会导致很多对系统不利的情形出现，如节点之间发生碰撞，系统拓扑结构发生毁灭性破坏等。为避免上述情况，现对节点定义如下操作： Fault isolation refers to removing the control quantity of the faulty node from its adjacent nodes, and blocking the information exchange channel for the faulty node to receive the information of the adjacent nodes, so as to eliminate the influence of the faulty node. It is easy to think that when a node detects that its adjacent nodes have failed, it only needs to remove the node from the set of adjacent nodes and stop sending its own status information to the node to complete the isolation work. Note that there is no interruption of sending the information of its own adjacent nodes to the faulty node, because the two are not directly connected, and sending this information has little impact on the system. Termination of self-information transmission is mainly due to information security considerations, because the cause of the failure is unknown, if the node has been controlled by the hostile party, continuing to send data may be maliciously used by the node, thereby affecting its own control. However, the termination of information transmission will also bring about a problem, that is, the faulty node cannot receive the information of the adjacent node, it will judge its adjacent node as a faulty node and also stop the transmission of information, so that when the faulty node terminates the transmission to all adjacent nodes information, it will be completely invisible to the system, and the harm caused by its movement to the system will also be completely unavoidable, which will lead to many situations that are unfavorable to the system, such as collisions between nodes, and system topology changes. Destructive damage, etc. In order to avoid the above situation, the following operations are defined for the node:

操作1：当节点无法接收到某一邻接节点的信息时，通过已接收到的邻接节点对自身的评价，利用式(5)求出系统对自身的诊断结果。若该结果超出一定的幅值Q_con，即可判断自身出现故障，此时屏蔽自身的故障检测功能，但仍仍向邻接节点发送数据信息，只是信息中将不再包含内容3所列举的部分。 Operation 1: When a node cannot receive the information of a certain adjacent node, use the received evaluation of the adjacent node to obtain the self-diagnosis result of the system by using formula (5). If the result exceeds a certain amplitude Q _con , it can be judged that it has a fault. At this time, its own fault detection function is blocked, but data information is still sent to adjacent nodes, but the information will no longer contain the parts listed in content 3. .

通过上述操作，故障节点将不会对剩余正常节点做出评价，但其运动对系统而言仍是可见的，以便系统及早对其破坏性活动做出反应。另外，保留的信息交互内容1和2将使故障节点成为一个信息传递的中继节点，避免故障隔离操作对系统拓扑结构产生毁灭性破坏，此部分内容的具体论述将在下文中给出。 Through the above operations, the faulty node will not evaluate the remaining normal nodes, but its movement is still visible to the system, so that the system can react to its destructive activities early. In addition, the retained information exchange content 1 and 2 will make the faulty node become a relay node for information transmission, avoiding the destructive damage to the system topology caused by the fault isolation operation. The specific discussion of this part will be given below.

下面给出故障修复方案： The fault repair plan is given below:

故障修复的目的是：若故障节点发生故障后未能得到及时隔离，仍对系统产生了一定的影响，则采取故障修复来消除该部分影响。对于故障修复，一个直观的想法就是将故障节点的控制量分离出来，将其取反并重新加入原控制量中，借以抵消故障节点的影响。但是，这种方案要求节点的控制协议具有线性可叠加的形式，以便能够分离出故障节点的控制量。但对于复杂的多智能体系统而言，常常会出现控制协议是非线性不可叠加的情况。因此，本发明将从节点的实际控制效果出发，定义一种新的补偿量计算及故障修复方案，同时对该方案的可行性给出证明。 The purpose of fault repair is: if the faulty node fails to be isolated in time after the fault occurs, and still has a certain impact on the system, fault repair is adopted to eliminate this part of the impact. For fault recovery, an intuitive idea is to separate the control quantity of the faulty node, invert it and add it back to the original control quantity, so as to offset the influence of the faulty node. However, this scheme requires the control protocol of the nodes to have a linearly superimposed form, so that the control quantities of the faulty nodes can be separated. But for complex multi-agent systems, it often happens that the control protocol is non-linear and cannot be superimposed. Therefore, the present invention will start from the actual control effect of the node, define a new compensation amount calculation and fault repair scheme, and at the same time provide proof of the feasibility of the scheme.

对于多智能体系统中的任意一个节点i，若其未发生故障，则期望输出u_i与节点的实际输出y_i之间必定满足如图4所示的关系。此处忽略执行器自身存在的执行误差。 For any node i in the multi-agent system, if it does not fail, the relationship between the expected output u _i and the actual output y _i of the node must satisfy the relationship shown in Figure 4. Here, the execution error of the actuator itself is ignored.

图中，u_imax代表节点i的最大期望输出，y_imax代表节点实际的输出上限。对于实际的节点而言，当期望输出超出节点所能达到的实际输出上限时，节点只能在最大输出y_imax下运行，这将导致部分控制量无法在输出中表现出来。因此，为了避免节点将此饱和特性诊断为故障，就需要对控制量进行限幅设置，即这样，期望输出与实际输出之间将会存在如下关系： In the figure, u _imax represents the maximum expected output of node i, and y _imax represents the actual output upper limit of the node. For the actual node, when the expected output exceeds the actual output upper limit that the node can achieve, the node can only operate under the maximum output y _imax , which will cause some control variables to fail to appear in the output. Therefore, in order to prevent nodes from diagnosing this saturation characteristic as a fault, it is necessary to limit the control quantity, that is, In this way, there will be the following relationship between the expected output and the actual output:

y_i＝a·u_i(9) y _i =a·u _i (9)

其中常数a∈R⁺为系统的输出增益，本发明中假定a＝1。 The constant a∈R ⁺ is the output gain of the system, and it is assumed that a=1 in the present invention.

需要说明的是，此处的u_i只是对节点期望输出的一个数学描述，并不是控制器真正的输出。对于实际的执行器而言，其非线性复杂多样，并不仅仅是饱和特性这样简单，控制器还需采取相应的控制算法，如PID控制、模糊控制等来保证节点能正常执行输出任务。另外，上文中提到的节点实际输出y_i是指节点理想化的稳态响应结果，其动态响应特性并不在本发明的讨论范围之内。 It should be noted that u _i here is only a mathematical description of the expected output of the node, not the real output of the controller. For the actual actuator, its nonlinearity is complex and diverse, not just the saturation characteristic, the controller also needs to adopt corresponding control algorithms, such as PID control, fuzzy control, etc. to ensure that the nodes can perform output tasks normally. In addition, the actual node output y _i mentioned above refers to the idealized steady-state response result of the node, and its dynamic response characteristics are not within the scope of the present invention.

现针对补偿量的计算定义如下操作： The following operations are defined for the calculation of the compensation amount:

操作2：当节点i在k＝T_i时刻检测到其邻接节点j发生故障时，利用式(3)计算在没有j影响的条件下自身的控制量同时计算的值并将其取反累加起来，直至到达下一个故障隔离与修复时刻k＝T_ip。 Operation 2: When node i detects that its adjacent node j has failed at k=T _i , use formula (3) to calculate its own control quantity without the influence of j Simultaneous computing The value of , and its inversion and accumulation, until the next fault isolation and repair time k=T _ip is reached.

由操作2可知，对于节点i而言，为消除故障节点j的影响而需要施加的补偿量为: From operation 2, we can see that for node i, the amount of compensation that needs to be applied to eliminate the influence of faulty node j is:

${u u}_{{i i}_{comp comp},, j j} = = - - {Σ Σ}_{k k = = {T T}_{i i}}^{{T T}_{ip ip}} (({u u}_{i i}^{k k} - - {u u}_{i i \ \ j j}^{k k})) - - - - - - ((1010))$

由于实际系统中存在输出的最大幅值，因此不能将由式(10)求得的补偿量简单地加至原控制量中，需考虑补偿量的加入是否会导致原控制量超过限幅值而出现补偿不充分的情况。为此定义如下操作： Since there is a maximum output value in the actual system, the compensation value obtained by formula (10) cannot be simply added to the original control value, and it is necessary to consider whether the addition of the compensation value will cause the original control value to exceed the limit value and appear Circumstances of Inadequate Compensation. To do this define the following operations:

操作3：在k＝(k*+1)T_p,k∈Z⁺时，若节点i已确认节点j发生故障，则将由式(10)求得的补偿量加入至原控制量中，同时检测此时的控制量是否超出限幅值，若是，则将超出限幅值的部分重新赋值给补偿量，待下一个隔离与修复时刻继续进行补偿；若否，则将补偿量清零，修复工作完成。 Operation 3: When k=(k*+1)T _p , k∈Z ⁺ , if node i has confirmed that node j has failed, add the compensation amount calculated by formula (10) to the original control amount, and at the same time Detect whether the control amount at this time exceeds the limit value, if so, reassign the part exceeding the limit value to the compensation amount, and continue to compensate at the next isolation and repair time; if not, clear the compensation amount and repair job done.

由操作3可知，在若干个故障检测与修复周期之后，补偿量将会被完全加至控制量当中，此时即完成了对故障节点的修复工作。 It can be seen from operation 3 that after several fault detection and repair cycles, the compensation amount will be completely added to the control amount, and the repair work on the faulty node is completed at this time.

下面给出基于二跳信息的网络连通性保持方案： The network connectivity maintenance scheme based on two-hop information is given below:

从上文的分析中可知，对于发生Ⅱ类故障，即信息交互系统遭到破坏的节点，其对网络连通性的影响在现有的拓扑结构下将是无法修复的。但是，若节点仍保留有正常的信息交互功能，则可将其视为一个信息传输的中继节点，建立起二跳的信息传输路径，借此修复可能遭到毁灭性破坏的网络拓扑结构。具体实现方案如下： From the above analysis, it can be seen that for the nodes where the type II failure occurs, that is, the information exchange system is destroyed, its impact on network connectivity will be irreparable under the existing topology. However, if the node still retains the normal information interaction function, it can be regarded as a relay node for information transmission, and a two-hop information transmission path can be established to repair the network topology that may be destroyed. The specific implementation plan is as follows:

如图1所示，假设节点3发生故障，若只对其采取隔离修复操作，则节点1、2与节点4、5、6、7之间将没有信息交互，原图的连通性遭到了破坏，该系统将无法协同完成控制目标。考虑操作1中对故障节点的规定可知，此时故障节点的邻接节点仍能接收到其传来的信息交互内容1和2的信息，其中内容2中将包含其邻接节点的完整信息。由此可以设想，将该故障节点作为信息传输的中继节点，在其两个不相邻的邻接节点间建立起虚拟的信息传输路径，借以保持原图的连通性。定义如下操作： As shown in Figure 1, assuming that node 3 fails, if only isolation and repair operations are taken on it, there will be no information interaction between nodes 1, 2 and nodes 4, 5, 6, and 7, and the connectivity of the original graph will be destroyed , the system will not be able to cooperate to complete the control objectives. Considering the regulations on the faulty node in operation 1, it can be seen that the adjacent nodes of the faulty node can still receive the information of content 1 and 2 from the faulty node, and the content 2 will contain the complete information of its adjacent nodes. Therefore, it can be imagined that the faulty node is used as a relay node for information transmission, and a virtual information transmission path is established between its two non-adjacent adjacent nodes, so as to maintain the connectivity of the original graph. Define the following operations:

操作4：若节点i检测到其邻接节点j发生故障，则在完成故障隔离操作后，检测节点j信息交互内容2中其邻接节点的信息和若且则令 $z_{l}^{k} &Element; I_{i}^{k}, Q_{l}^{k} &Element; {m &Element; N_{i}^{k} | Q_{m}^{k}} .$ Operation 4: If node i detects that its adjacent node j is faulty, after the fault isolation operation is completed, check the information of its adjacent nodes in the information interaction content 2 of node j and like and order $z_{l}^{k} &Element; I_{i}^{k}, Q_{l}^{k} &Element; {m &Element; N_{i}^{k} | Q_{m}^{k}} .$

经过操作4后，图1将变成如图5所示的拓扑结构，其中节点3发生故障。虚线代表以节点3为中继节点的二跳信息传输通路。 After operation 4, Figure 1 will become the topology shown in Figure 5, where node 3 fails. The dotted line represents the two-hop information transmission path with node 3 as the relay node.

由操作4可知，若故障节点的两个邻接节点之间并没有直接的信息交互联系，则经过上述操作，将会在两节点之间建立起一条虚拟的信息传输通道，使两节点成为理论意义上的邻接节点，由此即可保证原图的连通性不被破坏。 It can be seen from operation 4 that if there is no direct information interaction between the two adjacent nodes of the faulty node, after the above operations, a virtual information transmission channel will be established between the two nodes, making the two nodes a theoretically significant The adjacent nodes on the graph can ensure that the connectivity of the original graph is not destroyed.

现针对多智能体控制中的一致性问题来验证本发明所提出的故障检测、隔离与修复方案的可行性。 Now aiming at the consistency problem in multi-agent control, the feasibility of the fault detection, isolation and repair scheme proposed by the present invention is verified.

首先假定对于所有的道路a_ij的值都相等，则此时控制协议P为齐次的，即所有节点的控制量生成方式都完全相同。这样，通过利用信息交互内容1-3，节点接收其邻接节点的信息，并利用自身的控制协议对邻接节点进行诊断，然后利用流言传播(Gossip)算法进行信息处理，即可完成对故障的检测。之后，通过利用操作1-4，可正常完成对故障的隔离修复以及网络连通性保持的任务。整个过程中并无特殊的条件来限制该故障处理方案的应用，因此可以证明，若多智能体网络中道路的权重值相等，则该故障处理方案可以应用于一致性问题中。 First assume that for all roads The values of a and _ij are all equal, then the control protocol P is homogeneous at this time, that is, the control quantity generation methods of all nodes are exactly the same. In this way, by using information exchange content 1-3, the node receives the information of its adjacent nodes, and uses its own control protocol to diagnose the adjacent nodes, and then uses the gossip propagation (Gossip) algorithm for information processing to complete the fault detection . Afterwards, by using operations 1-4, the tasks of isolating and repairing faults and maintaining network connectivity can be normally completed. There are no special conditions in the whole process to limit the application of the fault handling scheme, so it can be proved that if the weight values of the roads in the multi-agent network are equal, the fault handling scheme can be applied to the consistency problem.

但是，实际的多智能体系统有可能存在这种情况：在不同的信息交互网络中，不同道路的权重值也各不相同，也就是说控制协议P不再是齐次的。此时，需要对信息交互内容2进行一些修改，节点i不再传送其邻接节点的状态信息I_i ^k，改为传送新的信息其中{i₁,...,i_p}＝N_i ^k。这样，节点间的控制协议就不再包含非齐次项a_ij，仍可将之视为齐次的控制协议，因此，参照上文的分析可知，原方案仍然适用。 However, there may be such a situation in the actual multi-agent system: in different information interaction networks, the weight values of different roads are also different, that is to say, the control protocol P is no longer homogeneous. At this time, some modifications need to be made to the content of information interaction 2. Node i no longer transmits the state information I _i ^k of its adjacent nodes, but transmits new information where {i ₁ , . . . , i _p }=N _i ^k . In this way, the control protocol between nodes no longer contains non-homogeneous items a _ij , and it can still be regarded as a homogeneous control protocol. Therefore, referring to the above analysis, we can see that the original scheme is still applicable.

下面给出软件仿真结果： The software simulation results are given below:

如图6-8所示，这三幅图展示的是利用MATLAB对8个多智能体进行一致性控制仿真的结果。图6(左)、图7(左)、图8(左)分别是对故障1、故障2、故障3进行故障检测隔离与修复后的结果，而对应的图6(右)、图7(右)、图8(右)分别为对应的不采取故障处理操作时的结果。从图中可以发现，若不对故障进行处理，则剩余节点随时间的推移将会被故障节点带离预期目标，从而导致整个系统控制目标无法实现。而采取故障处理方案后，故障节点对剩余节点将不再产生影响，剩余的正常节点仍能按预期完成一致性控制。 As shown in Figure 6-8, these three figures show the results of the consensus control simulation of 8 multi-agents using MATLAB. Figure 6 (left), Figure 7 (left), and Figure 8 (left) are the results of fault detection, isolation and repair of fault 1, fault 2, and fault 3, respectively, while the corresponding graphs 6 (right), 7 ( Right) and Figure 8 (right) are the corresponding results when no fault handling operation is taken. It can be seen from the figure that if the fault is not dealt with, the remaining nodes will be taken away from the expected target by the faulty node over time, which will lead to the failure of the entire system control target to be realized. After adopting the fault handling scheme, the faulty node will no longer affect the remaining nodes, and the remaining normal nodes can still complete the consistency control as expected.

以上所述的仅为本发明的较佳实施例而已，本发明不仅仅局限于上述实施例，凡在本发明的精神和原则之内所做的局部改动、等同替换、改进等均应包含在本发明的保护范围之内。 What has been described above is only a preferred embodiment of the present invention, and the present invention is not limited to the above-mentioned embodiment, and all local changes, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention should be included in within the protection scope of the present invention.

Claims

1. A distributed real-time fault detection and compensation control method based on information interaction, characterized in that, comprising the steps:

Step 1. System and fault modeling: the modeling includes a node dynamics model, an information interaction model, and a typical fault model; wherein the node dynamics model adopts a single integrator model, and the motion state of the node is described by a first-order differential equation; the information The interaction model is described by an undirected graph, that is, two-way communication between nodes is possible, and each independent agent uses this to perform information interaction and complete system control tasks; the typical fault model includes the fault types that often occur in real agents;

Step 2. Multi-agent real-time fault detection based on information interaction: select relevant state variables from the expression of the node dynamic model described in step 1 as the description of the node's operating state; by setting the threshold function, the node's The running status is divided to distinguish between normal nodes and faulty nodes; at the same time, a single node obtains the status information of its adjacent nodes by means of the information interaction model described in step 1, and detects whether it is faulty through a detection algorithm to form a detection result of a single node;

Step 3. Information integration and processing based on Gossip algorithm: Due to the existence of problems such as communication packet loss and time lag, the single-node detection results in step 2 are greatly affected by the environment, and the reliability is not high; therefore, using the Gossip algorithm, the single-node The node detection results are integrated to obtain comprehensive detection results with higher reliability, and this is used as the final judgment basis for the node operating status to distinguish normal nodes from faulty nodes;

Step 4. Calculation and application of compensation for control quantities: If a faulty node is detected, the faulty node will be isolated through corresponding operations, and at the same time, based on the influence of the faulty node on the control quantity of its adjacent nodes, a relevant calculation scheme will be designed to obtain compensation The value of the quantity is added to the original control quantity to offset the impact of the faulty node on the system;

Step 5. Design connectivity maintenance based on two-hop information: starting from the information interaction model, analyze the communication content between faulty nodes, and establish a virtual information transmission path by using the two-hop information to ensure that the fault handling scheme will not affect the normal operation of the system.

2. A distributed real-time fault detection and compensation control method based on information interaction according to claim 1, wherein said fault types include destructive faults, out-of-control faults and interference faults.