CN113283516A

CN113283516A - Multi-sensor data fusion method based on reinforcement learning and D-S evidence theory

Info

Publication number: CN113283516A
Application number: CN202110605802.5A
Authority: CN
Inventors: 蒋雯; 黄方慧; 耿杰; 邓鑫洋
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2021-08-20
Anticipated expiration: 2041-06-01
Also published as: CN113283516B

Abstract

The invention discloses a multi-sensor data fusion method based on reinforcement learning and a D-S evidence theory, which comprises the following steps: step one, inputting data of multiple sensors; step two, constructing a Markov decision model; thirdly, the Q-learning algorithm realizes conflict data resolution; and step four, realizing multi-sensor data fusion by adopting a D-S evidence theory. The method based on the combination of reinforcement learning and D-S evidence theory is adopted to quickly and efficiently realize the multi-sensor data fusion, and the reinforcement learning is adopted to perform real-time online processing on conflict data and fault data in the multi-sensor to obtain effective multi-sensor data after conflict resolution, so that the problem of high data conflict is solved; and secondly, uncertain multi-source data can be fused well by adopting a D-S evidence theory fusion method, and the fusion performance is improved.

Description

A Multi-sensor Data Fusion Method Based on Reinforcement Learning and D-S Evidence Theory

技术领域technical field

本发明涉及多传感器数据融合领域，是一种基于强化学习和D-S证据理论实现多传感器数据融合的方法，实现多传感器数据实时在线有效融合。The invention relates to the field of multi-sensor data fusion, which is a method for realizing multi-sensor data fusion based on reinforcement learning and D-S evidence theory, and realizes real-time online and effective fusion of multi-sensor data.

背景技术Background technique

随着现代科学技术在工业设备上的应用，使得设备复杂化，单一传感器信息已经无法准确体现设备的复杂运行情况。且由于其容易受到环境干扰，会导致获得的数据可能存在故障数据，无法实现对复杂设备的准确决策。而多传感器能将系统的不同特征同时反应出来，将多源数据融合有利于提高系统性能，使结果可信程度高。With the application of modern science and technology in industrial equipment, the equipment is complicated, and the information of a single sensor can no longer accurately reflect the complex operation of the equipment. And because it is easily disturbed by the environment, the obtained data may contain faulty data, which cannot realize accurate decision-making on complex equipment. The multi-sensor can simultaneously reflect the different characteristics of the system, and the fusion of multi-source data is beneficial to improve the performance of the system and make the results more credible.

多传感器数据融合技术是充分利用不同时间与空间的多传感器数据，对数据进行分析、排序和融合以提高系统性能的重要数据处理技术，在实际生产和应用中起着关键作用。Multi-sensor data fusion technology is an important data processing technology that makes full use of multi-sensor data in different time and space to analyze, sort and fuse the data to improve system performance, and plays a key role in actual production and application.

在多传感器数据融合模型和方法中，D-S证据理论是最为有效的一种可处理不确定信息的方法之一。证据理论还提供了Dempster组合规则，可实现在无先验信息的情况下对证据进行融合。然而，当证据之间存在高度冲突时会产生违反直觉的判断。此外，在实际中，多传感器获得的数据可能存在时间不一致性，导致该方法无法实现对在线数据的实时融合。强化学习通过“试错”的方式进行学习，可实现与外界环境的自主实时交互而无需系统先验信息。因此采用强化学习和D-S证据理论相结合的方法实现对多传感器数据的实时在线融合。Among multi-sensor data fusion models and methods, D-S evidence theory is one of the most effective methods to deal with uncertain information. Evidence theory also provides the Dempster combination rule, which enables evidence fusion without prior information. However, counterintuitive judgments arise when there is a high degree of conflict between the evidence. In addition, in practice, the data obtained by multiple sensors may have time inconsistency, which makes this method unable to realize real-time fusion of online data. Reinforcement learning learns by "trial and error", which can realize autonomous real-time interaction with the external environment without system prior information. Therefore, the combination of reinforcement learning and D-S evidence theory is used to realize real-time online fusion of multi-sensor data.

发明内容SUMMARY OF THE INVENTION

为了实现多传感器实时在线数据融合，本发明基于强化学习和D-S证据理论，提供一种智能多传感器数据融合方法，解决了冲突证据和在线实时融合的问题。In order to realize multi-sensor real-time online data fusion, the present invention provides an intelligent multi-sensor data fusion method based on reinforcement learning and D-S evidence theory, which solves the problem of conflicting evidence and online real-time fusion.

本发明解决其技术问题所采用的技术方案包括如下步骤：The technical scheme adopted by the present invention to solve its technical problems comprises the following steps:

步骤一：多传感器数据输入Step 1: Multi-sensor data input

将获得的多传感器数据表示为：{D₁,D₂,…,D_i}，其中，D_i代表第i个传感器的数据，在该系统中共有i个传感器。此外，传感器的数据表达形式为基本概率指派函数(BasicProbability Assignment，BPA)。若在测量过程中有新的传感器数据加入时，将其写入数据中，表示为：{D₁,D₂,…,D_i,D_新}。The acquired multi-sensor data is expressed as: {D ₁ , D ₂ ,...,D _i }, where D _i represents the data of the ith sensor, and there are i sensors in total in this system. In addition, the data expression form of the sensor is Basic Probability Assignment (BPA). If new sensor data is added in the measurement process, it will be written into the data, expressed as: {D ₁ , D ₂ ,..., D _i , D _new }.

步骤二：马尔科夫决策模型构建Step 2: Construction of Markov Decision Model

采用强化学习对多传感器数据进行自适应冲突消解时，需进行马尔科夫决策(Markov decision process，MDP)模型构建。MDP模型包括系统状态、动作、奖励函数，具体为：When using reinforcement learning for adaptive conflict resolution of multi-sensor data, a Markov decision process (MDP) model needs to be constructed. The MDP model includes system states, actions, and reward functions, specifically:

(1)动作集合A：由于多源传感器的信息量不同，在不同的传感器信息下系统应该做出不同的动作选择，使得当存在高冲突信息时，对高冲突信息进行冲突消解，以保证融合结果的准确性有效性，因此将动作集合定义为：A＝{a₁,a₂}＝{保留，删除}，系统可以根据实际情况采取保留或者删除的动作；(1) Action set A: Due to the different amount of information of multi-source sensors, the system should make different action choices under different sensor information, so that when there is high conflict information, the conflict resolution is performed on the high conflict information to ensure fusion. The accuracy and validity of the result, so the action set is defined as: A={a ₁ , a ₂ }={reservation, deletion}, the system can take the action of retaining or deleting according to the actual situation;

(2)状态集合S：当融合系统在不同时刻不同传感器信息下采取了某一动作后，系统的状态会发生转移，我们定义当前时刻的融合结果作为系统的状态，即：

其中m_t和m_t+1表示当前时刻采取不同动作的融合结果，a_t+1表示当前采取的动作。系统状态集合表示为：S＝{s₁,s₂,…,s_t,…}。(2) State set S: When the fusion system takes a certain action under different sensor information at different times, the state of the system will transfer. We define the fusion result at the current moment as the state of the system, namely:

Among them, m _t and m _t+1 represent the fusion result of different actions taken at the current moment, and a _t+1 represents the currently taken action. The system state set is expressed as: S={s ₁ ,s ₂ ,...,s _t ,...}.

(3)奖赏函数R：表示系统在运行过程中在某一状态s以及某一动作a下，系统给予的奖励值或者惩罚值。(3) Reward function R: Represents the reward value or penalty value given by the system under a certain state s and a certain action a during the operation of the system.

在多传感器数据融合算法中，通过奖赏函数值的设定进行不同动作的选取，最终在相同的数据情况下得到更准确有效的融合结果。强化学习中通过最大化累积奖赏值得到最优动作，因此奖赏函数设定至关重要。本发明中通过某一状态下融合结果的质量好坏设定奖励函数，采用邓熵评价融合结果质量，邓熵是证据间的信度度量方式。系统不同状态下的邓熵E(m_t)定义为：In the multi-sensor data fusion algorithm, different actions are selected through the setting of the reward function value, and finally a more accurate and effective fusion result is obtained under the same data condition. In reinforcement learning, the optimal action is obtained by maximizing the accumulated reward value, so the setting of the reward function is very important. In the present invention, the reward function is set according to the quality of the fusion result in a certain state, and Deng entropy is used to evaluate the quality of the fusion result, and Deng entropy is a measure of reliability between evidences. The Deng entropy E(m _t ) in different states of the system is defined as:

其中，Θ表示辨识框架，A是辨识框架中的子集，对应的t时刻下的BPA为m_t(A)，|A|表示A的势。Among them, Θ represents the recognition frame, A is the subset in the recognition frame, the corresponding BPA at time t is m _t (A), and |A| represents the potential of A.

邓熵的值越大，BPA包含的信息量越大，表明当前状态下的融合结果较好。当E(m_t+1)≥E(m_t)时，说明新的状态s_t+1是有益的，此时应该给予一个积极的奖赏。反之，说明新的状态s_t+1是消极的，此时应该给予一个惩罚值。因此t+1时刻的奖赏函数定义为：The larger the value of Deng entropy, the larger the amount of information contained in the BPA, indicating that the fusion result in the current state is better. When E(m _t+1 )≥E(m _t ), it means that the new state s _t+1 is beneficial, and a positive reward should be given at this time. On the contrary, it means that the new state s _t+1 is negative, and a penalty value should be given at this time. Therefore, the reward function at time t+1 is defined as:

步骤三：Q-learning算法实现冲突数据消解Step 3: Q-learning algorithm realizes conflict data resolution

实现对多传感器数据融合中冲突消解的目的是找到一个最优策略π：π:S→A，即a＝π(s)，使得系统在有冲突数据情况下做出最佳策略选择，实现有效冲突消解。策略的选择由环境和智能体经过反复探索试错，最终在某一策略下得到系统立即奖励和未来奖励值加和最大的策略为最优策略。具体为：在t时刻，多传感器数据融合系统接收到某一状态s_t和即时奖励R_t，根据奖励值确定当前要执行的动作(保留或者去除当前的证据)，然后智能融合系统转移至下一状态s_t+1并产生奖励R_t+1。此过程本发明采用Q-learning算法实现。The purpose of realizing conflict resolution in multi-sensor data fusion is to find an optimal strategy π: π:S→A, that is, a=π(s), so that the system can make the best strategy choice in the case of conflicting data and achieve effective Conflict resolution. The selection of the strategy is made by the environment and the agent through repeated exploration and trial and error, and finally, under a certain strategy, the strategy with the largest sum of the system's immediate reward and future reward value is the optimal strategy. Specifically: at time t, the multi-sensor data fusion system receives a certain state s _t and an immediate reward R _t , determines the current action to be performed (retains or removes the current evidence) according to the reward value, and then the intelligent fusion system transfers to the next A state s _t+1 produces a reward R _t+1 . The present invention adopts the Q-learning algorithm to realize this process.

Q-learning算法通过最大化累积折扣奖励值获得最佳策略，Q值函数为在某一状态s_t和某一动作a_t下的累积奖赏值，表达式为：The Q-learning algorithm obtains the best strategy by maximizing the cumulative discount reward value. The Q value function is the cumulative reward value under a certain state s _t and a certain action a _t , and the expression is:

其中，γ表示折扣因子。where γ represents the discount factor.

采用ε-greedy算法进行最优动作(策略)选择，具体表示为：The ε-greedy algorithm is used to select the optimal action (strategy), which is specifically expressed as:

其中，π^*(a|s)表示最优策略；ε表示探索概率。Among them, π ^* (a|s) represents the optimal strategy; ε represents the exploration probability.

并采用下式进行Q值函数的更新：And use the following formula to update the Q-value function:

其中，α∈(0,1]表示学习速率，s_t+1表示下一状态。where α∈(0,1] represents the learning rate and s _t+1 represents the next state.

多传感器数据融合系统通过和环境的反复更新交互，得到每一时刻的最优动作，最终将高冲突BPA予以剔除，实现自适应在线数据处理。The multi-sensor data fusion system obtains the optimal action at each moment through repeated updating and interaction with the environment, and finally eliminates the high-conflict BPA to realize adaptive online data processing.

步骤四：采用证据理论实现多传感器数据融合Step 4: Using evidence theory to realize multi-sensor data fusion

对传感器数据进行冲突消解处理后，采用证据理论中的Dempster组合规则实现对多源数据有效融合，具体为：After the conflict resolution processing of the sensor data, the Dempster combination rule in the evidence theory is used to realize the effective fusion of multi-source data, which is as follows:

其中，

in,

本发明的有益效果在于本发明采用证据理论与强化学习相结合的方法能高效快速实现对多传感器数据融合；本发明采用强化学习对多传感器中的冲突数据、故障数据进行实时在线处理，得到冲突消解后的有效多传感器数据，解决了数据高冲突的问题；本发明采用的D-S证据理论融合方法能很好的融合不确定多源数据，提升融合性能。The beneficial effect of the present invention is that the present invention adopts the method of combining evidence theory and reinforcement learning, which can efficiently and quickly realize the fusion of multi-sensor data; The digested effective multi-sensor data solves the problem of high data conflict; the D-S evidence theory fusion method adopted in the present invention can well fuse uncertain multi-source data and improve fusion performance.

附图说明Description of drawings

图1是本发明实现的总体模型图；Fig. 1 is the overall model diagram that the present invention realizes;

图2是汽车故障数据；Figure 2 is the car failure data;

图3是多传感器数据融合结果。Figure 3 is the result of multi-sensor data fusion.

具体实施方式Detailed ways

下面结合附图和实例对本发明进一步说明。此处给出一个汽车系统故障诊断的实例，实验数据来自[1]。[1]中显示，在汽车系统中共有三种类型的故障(此处用F₁，F₂和F₃表示)，分别为低油压，进气系统漏气，电磁阀卡死。此外，其通过五种传感器C₁,C₂,C₃,C₄,C₅获取汽车故障数据。本发明中结合该汽车系统故障数据说明所提出的方法的实施步骤。The present invention will be further described below in conjunction with the accompanying drawings and examples. An example of fault diagnosis of automobile system is given here, and the experimental data comes from [1]. It is shown in [1] that there are three types of faults in the automotive system (represented here by F ₁ , F ₂ and F ₃ ), which are low oil pressure, air leakage in the intake system, and solenoid valve stuck. In addition, it obtains vehicle fault data through five sensors C ₁ , C ₂ , C ₃ , C ₄ , and C ₅ . The implementation steps of the proposed method are described in the present invention in conjunction with the vehicle system fault data.

步骤一：多传感器数据输入Step 1: Multi-sensor data input

将从汽车系统获得的多传感器数据表示为：{D₁,D₂,D₃,D₄,D₅}，表明在该系统中共有5个传感器。在该汽车系统故障数据中，C₅传感器处于失效状态，其余传感器皆正常工作。具体数据如图2中所示。其中，m(F₁)代表故障类型为F₁的信度值，m(F₂)代表故障类型为F₂的信度值，m(F₃)代表故障类型为F₃的信度值，m(Θ)代表故障类型无法确定的信度值，Θ＝{F₁,F₂,F₃}。本发明的目的是根据多传感器数据判断汽车系统发生了哪种类型的故障。The multi-sensor data obtained from the automotive system is represented as: {D ₁ , D ₂ , D ₃ , D ₄ , D ₅ }, indicating that there are a total of 5 sensors in the system. In the fault data of the automobile system, the _C5 sensor is in a failed state, and the rest of the sensors are working normally. The specific data are shown in Figure 2. Among them, m(F ₁ ) represents the reliability value of fault type F ₁ , m(F ₂ ) represents the reliability value of fault type F ₂ , m(F ₃ ) represents the reliability value of fault type F ₃ , m(Θ) represents the reliability value for which the fault type cannot be determined, Θ={F ₁ , F ₂ , F ₃ }. The purpose of the present invention is to determine which type of failure has occurred in the vehicle system based on multi-sensor data.

对汽车系统数据采用强化学习进行自适应冲突消解，首先进行马尔科夫(Markovdecision process，MDP)模型构建。MDP模型包括系统状态、动作、奖励函数，具体为：Reinforcement learning is used for the adaptive conflict resolution of the vehicle system data, and the Markov decision process (MDP) model is constructed first. The MDP model includes system states, actions, and reward functions, specifically:

(1)动作集合A：由于5个传感器的信息量不同，在不同的传感器信息下系统应该做出不同的动作选择，使得当存在高冲突信息时，对高冲突信息进行冲突消解，以保证融合结果的准确性有效性，因此动作集合为：A＝{a₁,a₂}＝{保留，删除}。(1) Action set A: Due to the different information amounts of the five sensors, the system should make different action choices under different sensor information, so that when there is high-conflict information, the conflict resolution is performed on the high-conflict information to ensure fusion. The accuracy and validity of the result, so the action set is: A={a ₁ , a ₂ }={retain, delete}.

(2)状态集合S：当融合系统采取了某一动作后，系统的状态会发生转移，本发明定义当前时刻的融合结果作为系统的状态，即：

其中m_t和m_t+1表示当前时刻采取不同动作的融合结果，a_t+1表示当前采取的动作。系统状态集合表示为：S＝{s₁,s₂,…,s_t,…}。(2) State set S: when the fusion system takes a certain action, the state of the system will transition. The present invention defines the fusion result at the current moment as the state of the system, namely:

(3)奖赏函数R：表示系统在运行过程中在某一状态s以及某一动作a下，系统给予的奖励值或者惩罚值。本发明中通过某一状态下融合结果的质量好坏设定奖励函数，采用邓熵评价融合结果质量，邓熵是证据间的信度度量方式。系统不同状态下的邓熵E(m_t)定义为：(3) Reward function R: Represents the reward value or penalty value given by the system under a certain state s and a certain action a during the operation of the system. In the present invention, the reward function is set according to the quality of the fusion result in a certain state, and Deng entropy is used to evaluate the quality of the fusion result, and Deng entropy is a measure of reliability between evidences. The Deng entropy E(m _t ) in different states of the system is defined as:

其中，γ表示折扣因子。where γ represents the discount factor.

步骤四：采用D-S证据理论实现多传感器数据融合Step 4: Use D-S evidence theory to realize multi-sensor data fusion

其中，

in,

最后采用本发明的基于强化学习和D-S证据理论的多传感器数据融合方法，按照图2中数据进行仿真分析，并和传统的Dempster组合规则、Yager组合规则进行仿真对比，融合结果如图3。由图3可以看出，本发明提出的融合方法具有相对较高的融合精度，能实现在多传感器数据下的高效准确智能融合。主要原因是：强化学习消除了多传感器数据中的冲突数据、故障数据，避免此类数据对融合结果造成负面影响。除此之外，该方法的优点是不需要考虑数据到达时间是否一致，可以有效实现数据的在线融合。Finally, the multi-sensor data fusion method based on reinforcement learning and D-S evidence theory of the present invention is adopted, and the simulation analysis is performed according to the data in Figure 2, and the simulation is compared with the traditional Dempster combination rule and Yager combination rule. The fusion result is shown in Figure 3. It can be seen from FIG. 3 that the fusion method proposed by the present invention has relatively high fusion accuracy, and can realize efficient, accurate and intelligent fusion under multi-sensor data. The main reason is that reinforcement learning eliminates conflicting data and fault data in multi-sensor data, and avoids such data from negatively affecting fusion results. In addition, the advantage of this method is that it does not need to consider whether the data arrival time is consistent, which can effectively realize the online fusion of data.

以上显示和描述了本发明的基本原理和主要特征和本发明的优点。本行业的技术人员应该了解，本发明不受上述实施例的限制，上述实施例和说明书中描述的只是说明本发明的原理，在不脱离本发明精神和范围的前提下，本发明还会有各种变化和改进，这些变化和改进都落入要求保护的本发明范围内。本发明要求保护范围由所附的权利要求书及其等效物界定。The basic principles and main features of the present invention and the advantages of the present invention have been shown and described above. Those skilled in the art should understand that the present invention is not limited by the above-mentioned embodiments, and the descriptions in the above-mentioned embodiments and the description are only to illustrate the principle of the present invention. Without departing from the spirit and scope of the present invention, the present invention will have Various changes and modifications fall within the scope of the claimed invention. The claimed scope of the present invention is defined by the appended claims and their equivalents.

参考文献references

[1]Yan X H,Zhu J H,Kuang M C,et al.Missile aerodynamic design usingreinforcement learning and transfer learning.Sci China Inf Sci,2018,61:119204.[1] Yan X H, Zhu J H, Kuang M C, et al. Missile aerodynamic design using reinforcement learning and transfer learning. Sci China Inf Sci, 2018, 61: 119204.

Claims

1. A multi-sensor data fusion method based on reinforcement learning and D-S evidence theory is characterized by comprising the following steps:

the method comprises the following steps: multi-sensor data input

The multi-sensor data obtained is represented as: { D₁,D₂,…,D_iIn which D is_iData representing the ith sensor, and there are i sensors in the system. In addition, the data expression of the sensor is in the form of a Basic Probability Assignment function (BPA); if new sensor data is added during the measurement process, the new sensor data is written into the data, and the data is expressed as: { D₁,D₂,…,D_i,D_New}；

Step two: markov decision model construction

When the reinforcement learning is adopted to perform self-adaptive conflict resolution on multi-sensor data, firstly, a Markov Decision Process (MDP) model is constructed, wherein the MDP model comprises a system state, an action and a reward function, and specifically comprises the following steps:

(1) movable partMaking a set A: define the action set as: a ═ a₁,a₂The system can take the actions of retention or deletion according to the actual situation;

(2) a state set S: the invention defines the fusion result of the current moment as the state of the system, namely:

wherein m is_tAnd m_t+1Indicating a fusion result of taking different actions at the current moment, a_t+1Representing the currently taken action, the system state set is represented as: s ═ S₁,s₂,…,s_t,s_t+1,…}；

(3) The reward function R: the invention sets a reward function according to the quality of the fusion result in a certain state, adopts the Deng entropy to evaluate the quality of the fusion result, and adopts the Deng entropy E (m) of the system in different states_t) Is defined as:

wherein, Θ represents the recognition framework, a is a subset in the recognition framework, and the corresponding BPA at time t is m_t(A) And | a | represents the potential of a;

the larger the value of the Dune entropy is, the larger the information content contained in the BPA is, which indicates that the fusion result in the current state is better; when E (m)_t+1)≥E(m_t) Now, the new state s is explained_t+1Advantageously, a positive reward should be given; conversely, a penalty value should be given; the reward function at time t +1 is thus defined as:

step three: q-learning algorithm for realizing conflict data resolution

The Q-learning algorithm achieves the best strategy by maximizing the cumulative discount reward value, the Q value function being at a certain states_tAnd a certain action a_tThe following jackpot value is expressed as:

wherein γ represents a discount factor;

the method adopts an epsilon-greedy algorithm to select the optimal action (strategy), and is specifically represented as follows:

wherein, pi^*(a | s) represents an optimal strategy; epsilon represents the exploration probability;

and updating the Q value function by adopting the following formula:

wherein, alpha is (0, 1)]Indicates the learning rate, s_t+1Represents the next state;

the multi-sensor data fusion system obtains the optimal action at each moment through repeated updating interaction with the environment, and finally eliminates high-conflict BPA to realize self-adaptive online data processing;

step four: multi-sensor data fusion realized by adopting D-S evidence theory

After conflict resolution processing is carried out on sensor data, effective fusion of multi-source data is realized by adopting Dempster combination rules in evidence theory, and the method specifically comprises the following steps:

wherein,