CN109934753B

CN109934753B - Multi-Agent emergency action decision method based on JADE platform and reinforcement learning

Info

Publication number: CN109934753B
Application number: CN201910182048.1A
Authority: CN
Inventors: 赵佳宝; 潘东旭; 潘昱宸
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2023-05-16
Anticipated expiration: 2039-03-11
Also published as: CN109934753A

Abstract

The invention provides a multi-Agent emergency action decision method based on a JADE platform and reinforcement learning, which comprises the following steps: starting a JADE platform and establishing a monitoring Agent, and judging whether an emergency public event occurs or not in real time by using the monitoring Agent; registering emergency resource guarantee service behaviors of each emergency resource warehouse Agent on the monitoring Agent, executing reinforcement learning of each emergency resource warehouse Agent, and obtaining reinforcement learning feedback values corresponding to each emergency resource warehouse Agent from the monitoring Agent; one or more emergency resource warehouse agents are selected from the reinforcement learning feedback values and added into the emergency resource allocation sequence. The multi-Agent emergency action decision method combines a multi-Agent technology with a reinforcement learning algorithm, and the supply of an emergency resource warehouse is allocated from the global start of the whole emergency action system, and the reinforcement learning algorithm fully utilizes the autonomy of agents to promote the intelligent level and the self-adaption capability of the multi-Agent system.

Description

Multi-agent emergency action decision-making method based on JADE platform and reinforcement learning

技术领域technical field

本发明属于人工智能技术领域，尤其是一种基于JADE平台和强化学习的多Agent应急行动策略方法。The invention belongs to the technical field of artificial intelligence, in particular to a multi-Agent emergency action strategy method based on a JADE platform and reinforcement learning.

背景技术Background technique

随着我国经济社会的迅速发展，各类突发公共事件也是层出不穷。据官方资料显示，2018年仅自然灾害一类就已经造成1.3亿人次受灾，直接经济损失超过2600亿元。有效的应急行动不仅仅可以预防和减少突发公共事件的发生，还能保证在突发公共事件发生时保障人民的人身和财产安全，尽快控制住事态形势并将损失降到最低。因此，如何利用多Agent、强化学习等人工智能技术对整个应急行动过程进行系统有效地监控、管理和辅助决策是应进一步展开的工作，具有重要意义。With the rapid development of my country's economy and society, various public emergencies emerge in endlessly. According to official data, in 2018, natural disasters alone have caused 130 million people to be affected, with direct economic losses exceeding 260 billion yuan. Effective emergency actions can not only prevent and reduce the occurrence of public emergencies, but also ensure the safety of people's personal and property when public emergencies occur, control the situation as soon as possible and minimize losses. Therefore, how to use artificial intelligence technologies such as multi-agent and reinforcement learning to systematically and effectively monitor, manage and assist decision-making during the entire emergency operation process is a work that should be further developed and is of great significance.

Agent是一类在特定环境下能感知环境，并能自治地运行以代表其设计者或使用者实现一系列目标计算实体或程序。多Agent系统，即MAS(Multi—Agent System)，其本质是“分而治之”的思维。多Agent系统的特点决定了它在很多分布式应用领域的独特优势。电子商务领域、交通运输领域、应急救援领域、辩论系统、电信系统这些都具有分布式交互的特性，采用多Agent系统可以显著改善不同实体的交互方式，优化执行计划，提供更好、更快、更为可靠的服务。另外在某些信息决策支持系统的构建中，多Agent系统也是极为有效的解决方案。JADE作为基于FIPA规范的多Agent系统仿真实现平台，功能完善、体系健全、移植性强，极大地简化了多Agent系统的开发。强化学习作为典型的无监督学习方法，目前已被广泛应用于无人驾驶、智能控制、辅助决策等诸多领域，利用多Agent系统中Agent的自主性来实施强化学习算法，有利于提高多Agent系统整体的智能性。Agent is a kind of computing entity or program that can perceive the environment in a specific environment and can operate autonomously to achieve a series of goals on behalf of its designer or user. Multi-Agent system, that is, MAS (Multi-Agent System), its essence is the thinking of "divide and conquer". The characteristics of multi-agent system determine its unique advantages in many distributed application fields. The e-commerce field, transportation field, emergency rescue field, debate system, and telecommunication system all have the characteristics of distributed interaction. The use of multi-Agent system can significantly improve the interaction mode of different entities, optimize the execution plan, and provide better, faster, More reliable service. In addition, in the construction of some information decision support systems, the multi-agent system is also an extremely effective solution. As a multi-Agent system simulation implementation platform based on FIPA specification, JADE has complete functions, a sound system, and strong portability, which greatly simplifies the development of multi-Agent systems. Reinforcement learning, as a typical unsupervised learning method, has been widely used in many fields such as unmanned driving, intelligent control, and auxiliary decision-making. Using the autonomy of Agents in multi-Agent systems to implement reinforcement learning algorithms is conducive to improving the performance of multi-Agent systems. overall intelligence.

发明内容Contents of the invention

本发明的目的在于：提供一种基于JADE平台和强化学习的多Agent应急行动决策方法，通过综合考虑运输的成本、距离、时间、有效性等来决定如何利用各个应急资源仓库协同提供应急资源，以较低的经济成本及时有效地提供应急资源保障。The purpose of the present invention is to provide a multi-Agent emergency action decision-making method based on the JADE platform and reinforcement learning, by comprehensively considering the transportation cost, distance, time, effectiveness, etc. to determine how to use each emergency resource warehouse to provide emergency resources collaboratively, Provide emergency resource protection in a timely and effective manner at a relatively low economic cost.

为了实现上述发明目的，本发明提供了一种基于JADE平台和强化学习的多Agent应急行动决策方法，包括如下步骤：In order to achieve the above-mentioned purpose of the invention, the present invention provides a multi-Agent emergency action decision-making method based on the JADE platform and reinforcement learning, comprising the following steps:

步骤1，启动JADE平台并建立监控Agent，利用监控Agent实时判断是否有突发公共事件发生，若有突发公共事件发生，则直接进入步骤2，若没有突发公共事件发生，则循环本步骤继续判断；Step 1. Start the JADE platform and establish a monitoring agent. Use the monitoring agent to judge whether there is a public emergency in real time. If there is a public emergency, go directly to step 2. If there is no public emergency, repeat this step. continue to judge;

步骤2，在监控Agent上注册各个应急资源仓库Agent的应急资源保障服务行为，并执行各个应急资源仓库Agent的强化学习，从监控Agent上获得各个应急资源仓库Agent对应的强化学习回馈值；Step 2, register the emergency resource guarantee service behavior of each emergency resource warehouse Agent on the monitoring agent, and execute the reinforcement learning of each emergency resource warehouse Agent, and obtain the reinforcement learning feedback value corresponding to each emergency resource warehouse Agent from the monitoring agent;

步骤3，从各个强化学习回馈值中选择一个或多个应急资源仓库Agent加入应急资源调配序列中。Step 3: Select one or more emergency resource warehouse Agents from each reinforcement learning feedback value to join in the emergency resource allocation sequence.

进一步地，步骤2中，监控Agent通过JADE的黄页服务实时搜索所有可能的应急资源仓库Agent。Further, in step 2, the monitoring agent searches for all possible emergency resource warehouse agents in real time through the yellow page service of JADE.

进一步地，步骤2中，应急资源仓库Agent的强化学习的具体步骤为：Further, in step 2, the specific steps of reinforcement learning of the emergency resource warehouse Agent are:

步骤a，初始化学习率Λ_t、折扣因子γ以及Q值；Step a, initialize learning rate Λ _t , discount factor γ and Q value;

步骤b，各个应急资源仓库Agent通过JADE交互协议的发起类与环境的响应类的交互，获得当前状态s_t，并根据状态转移函数P选择当前状态s_t下最优动作a_t，执行动作a_t转移到新的状态s_t+1；Step b, each emergency resource warehouse agent obtains the current state s _t through the interaction between the initiation class of the JADE interaction protocol and the response class of the environment, and selects the optimal action a _t under the current state s _t according to the state transition function P, and executes the action a _t transfers to a new state s _t+1 ;

步骤c，应急资源仓库Agent利用JADE交互协议的发起类从外界环境得到回报值r_t+1，并对Q值进行更新；Step c, the emergency resource warehouse Agent uses the initiation class of the JADE interaction protocol to obtain the return value r _t+1 from the external environment, and update the Q value;

步骤d，在Q值收敛后退出强化学习。Step d, quit reinforcement learning after the Q value converges.

进一步地，步骤b中，利用JADE的内容语言以本体形式存储JADE交互协议的发起类与环境的响应类的交互信息。Further, in step b, the JADE content language is used to store the interaction information of the initiator class of the JADE interaction protocol and the response class of the environment in the form of ontology.

进一步地，步骤b中，状态转移函数P是基于softmax函数来挑选动作策略，使平均回馈值较大的动作策略被采纳的几率更高。Further, in step b, the state transition function P selects an action strategy based on the softmax function, so that the action strategy with a larger average feedback value has a higher probability of being adopted.

进一步地，步骤b中，状态转移函数P的概率归一化公式为：Further, in step b, the probability normalization formula of the state transition function P is:

式中，τ表示退火温度，用于控制搜索率，当τ越小时，平均奖赏的差异越大，选取最优的策略可能越大，

表示归一化前相应动作选择造成状态转移的概率，

表示归一化前动作集合中全部动作造成状态转移的概率。In the formula, τ represents the annealing temperature, which is used to control the search rate. When τ is smaller, the difference in average reward is greater, and the optimal strategy may be selected larger.

Indicates the probability of state transition caused by the corresponding action selection before normalization,

Indicates the probability of state transition caused by all actions in the action set before normalization.

进一步地，步骤c中，Q值的计算公式为：Further, in step c, the formula for calculating the Q value is:

式中，γ∈[0,1)为折扣因子，Λ_t为学习率，A为动作集合，S为状态集合，Q_t(s_t,a_t)表示t时刻由s_t和a_t确定的q值，Q_t+1(s_t,a_t)表示t+1时刻的更新值，max_a∈AQ(s^′,a^′)表示这些Q值表里面的最大值。In the formula, γ∈[0,1) is the discount factor, Λ _t is the learning rate, A is the action set, S is the state set, Q _t ₍ s _t , a _t ₎ represents the q value, Q _t+1 (s _t , a _t ) represents the update value at time t+1, and max _a∈A Q(s ^′ , a ^′ ) represents the maximum value in these Q value tables.

进一步地，动作集合A＝{a1，a2}，状态集合S＝{C1，C2，D，F1，F2}，C1表示应急资源仓库Agent能够有效提供的库存容量，C2表示应急资源仓库Agent能够有效提供的应急物资种类，D表示应急资源仓库Agent与突发公共事件发生地的距离，F1表示单位距离应急资源的运输费用，F2表示单位质量应急资源的运输费用，a1表示该应急资源仓库Agent选择加入应急资源调配行列，a2表示该应急资源仓库Agent选择不加入。Further, the action set A={a1, a2}, the state set S={C1, C2, D, F1, F2}, C1 represents the inventory capacity that the emergency resource warehouse Agent can effectively provide, and C2 represents that the emergency resource warehouse Agent can effectively The types of emergency supplies provided, D represents the distance between the emergency resource warehouse Agent and the place where public emergencies occur, F1 represents the transportation cost of emergency resources per unit distance, F2 represents the transportation cost of emergency resources per unit quality, a1 represents the emergency resource warehouse Agent chooses Join the ranks of emergency resource allocation, a2 means that the emergency resource warehouse Agent chooses not to join.

本发明的有益效果在于：(1)将多Agent技术与强化学习算法相结合，从整个应急行动系统的全局出发来调配应急资源仓库的供应，强化学习算法充分利用了Agent的自主性，来促进多Agent系统的智能化水平和自适应能力；(2)具有较强的扩展性和应用性，可以与数字化的应急预案系统相结合，利用已有的监控数据信息和案例库进行计算机辅助决策，更加科学有效地指挥应急行动；(3)利用JADE平台构建Agent、实现多Agent系统开发，基于JADE平台的多Agent系统，利用JADE提供的通信交互协议、黄页服务、本体支持、Agent迁移等可以将应急救援处置过程和行动细节的模拟与实际突发公共事件应急处置的辅助决策应用结合起来，构建一套平时反正演练优化、战时提供辅助决策的系统应用框架体系。The beneficial effect of the present invention is: (1) multi-Agent technology and reinforcement learning algorithm are combined, set out from the overall situation of whole emergency action system to deploy the supply of emergency resource warehouse, reinforcement learning algorithm has fully utilized the autonomy of Agent, to promote The intelligent level and self-adaptive ability of the multi-Agent system; (2) It has strong scalability and applicability, and can be combined with the digital emergency plan system, using the existing monitoring data information and case database for computer-aided decision-making, Command emergency operations more scientifically and effectively; (3) Use the JADE platform to build Agents and realize multi-Agent system development. The multi-Agent system based on the JADE platform can use the communication interaction protocol, yellow pages service, ontology support, and Agent migration provided by JADE. The simulation of the emergency rescue and handling process and action details is combined with the auxiliary decision-making application of the actual emergency response to public emergencies to build a system application framework system that optimizes drills in peacetime and provides auxiliary decision-making in wartime.

附图说明Description of drawings

图1为本发明的有限状态机模型的应急行动决策总体流程图；Fig. 1 is the overall flowchart of the emergency action decision-making of the finite state machine model of the present invention;

图2为本发明的JADE平台的强化学习的结构图。Fig. 2 is a structural diagram of the reinforcement learning of the JADE platform of the present invention.

具体实施方式Detailed ways

如图1所示，本发明提供了一种基于JADE平台和强化学习的多Agent应急行动决策方法，包括如下步骤；As shown in Figure 1, the present invention provides a kind of multi-Agent emergency action decision-making method based on JADE platform and reinforcement learning, comprises the following steps;

步骤1，启动JADE平台并建立监控Agent，利用有限状态机模型(FSM)调度子行为来管理突发公共事件的应急行动，有限状态机由初始状态1开始，执行行为1：利用监控Agent实时判断是否有突发公共事件发生，若有突发公共事件发生，则直接进入步骤2，进入中间状态3，若没有突发公共事件发生，则进入中间状态2(预警行为)再迁移至初始状态1以循环本步骤继续判断；Step 1, start the JADE platform and establish a monitoring agent, use the finite state machine model (FSM) to schedule sub-behaviors to manage emergency actions for public emergencies, the finite state machine starts from the initial state 1, and execute behavior 1: use the monitoring agent to judge in real time Whether there is a sudden public event, if there is a public emergency, go directly to step 2 and enter the intermediate state 3, if there is no public emergency, enter the intermediate state 2 (warning behavior) and then migrate to the initial state 1 Continue to judge by looping this step;

步骤2，执行行为3，在监控Agent上注册各个应急资源仓库Agent的应急资源保障服务行为，再进入中间状态4，执行行为4：各个应急资源仓库Agent的强化学习，从监控Agent上获得各个应急资源仓库Agent对应的强化学习回馈值；将强化学习的任务对应一个四元组：E＝<S,A,P,R>，其中，S为当前状态，A为动作集合，P为状态转移函数，R为回馈函数，状态转移P基于softmax函数去挑选动作策略，保证平均回馈值比较大的动作策略被采纳的可能性更高，同时还保证了平均回馈值低的动作策略仍有被采纳的机会；应急资源仓库Agent从环境(监控Agent)获得以有效性、经济效益、时间距离等为主的回馈值，根据强化学习的基本原理，如果应急资源仓库Agent的某个行为策略改变环境后获得正的回馈值，那么Agent产生这个行为策略的趋势将会加强；反之将会减弱，多Agent系统中强化学习目标仍然是奖励回馈值最大，以γ折扣累积回馈值法计算长期累积回馈值，也就是每一次的折扣因子是以γ速率递减的；Step 2, execute behavior 3, register the emergency resource guarantee service behavior of each emergency resource warehouse Agent on the monitoring agent, and then enter the intermediate state 4, execute behavior 4: reinforcement learning of each emergency resource warehouse Agent, and obtain each emergency resource guarantee service behavior from the monitoring agent The reinforcement learning feedback value corresponding to the resource warehouse Agent; the task of reinforcement learning corresponds to a quaternion: E=<S,A,P,R>, where S is the current state, A is the action set, and P is the state transition function , R is the feedback function, and the state transition P is based on the softmax function to select the action strategy, which ensures that the action strategy with a relatively large average feedback value is more likely to be adopted, and at the same time ensures that the action strategy with a low average feedback value is still adopted. Opportunity; the emergency resource warehouse Agent obtains feedback values based on effectiveness, economic benefits, time distance, etc. from the environment (monitoring agent). According to the basic principle of reinforcement learning, if a certain behavior strategy of the emergency resource warehouse Agent changes the environment, the If the reward value is positive, then the agent’s tendency to produce this behavior strategy will be strengthened; otherwise, it will be weakened. In the multi-agent system, the reinforcement learning goal is still to reward the largest reward value. The long-term cumulative reward value is calculated by the γ-discount cumulative reward value method. That is, each time the discount factor decreases at a rate of γ;

步骤3，执行行为5，从各个强化学习回馈值中选择一个或多个应急资源仓库Agent加入应急资源调配序列中，此时监控Agent的有限状态机行为终止，应急行动结束。Step 3, execute behavior 5, select one or more emergency resource warehouse agents from each reinforcement learning feedback value to join the emergency resource allocation sequence, at this time, the finite state machine behavior of the monitoring agent is terminated, and the emergency action is over.

进一步地，步骤2中，监控Agent通过JADE平台的黄页服务实时搜索所有可能的应急资源仓库Agent。Further, in step 2, the monitoring agent searches for all possible emergency resource warehouse agents in real time through the yellow page service of the JADE platform.

步骤a，初始化学习率Λ_t、折扣因子γ以及Q值，折扣因子γ∈[0,1)，γ为0表示只看眼前的r；Step a, initialize learning rate Λ _t , discount factor γ and Q value, discount factor γ∈[0,1), γ being 0 means only looking at r in front of you;

步骤d，在Q值收敛后退出强化学习，进入终止状态5，再执行行为5。Step d, exit reinforcement learning after the Q value converges, enter termination state 5, and then execute behavior 5.

表示归一化前相应动作选择造成状态转移的概率，

γ∈[0,1)为折扣因子，Λ_t为学习率，r_t+1表示t+1时刻的回馈值，A为动作集合，S为状态集合，Q(s,a)表示s和a确定的q值，Q_t(s_t,a_t)表示t时刻由s_t和a_t确定的q值，Q_t+1(s_t,a_t)表示t+1时刻的更新值，max_a∈AQ(s^′,a^′)表示这些Q值表里面的最大值。学习率α_t控制学习的速度，其值与收敛速度成正比，但不能过大，否则收敛将不成熟。γ∈[0,1)作为折扣因子，越大则表示倾向于长期回馈值，越小则表示越倾向短期当前的回馈值。Q值的计算公式是经过贝尔曼方程处理后利用更新迭代计算不断逼近最优解，因此将强化学习行为设置为CyclicBehaviour(循环类),可以不断重复学习过程。γ∈[0,1) is the discount factor, Λ _t is the learning rate, r _t+1 represents the feedback value at time t+1, A is the action set, S is the state set, Q(s,a) represents s and a The determined q value, Q _t (st _t , a _t ) represents the q value determined by st _t and a _t at time t, Q _t+1 (st _t , a _t ) represents the updated value at time t+1, max _{a ∈A} Q(s ^′ , a ^′ ) represents the maximum value in these Q value tables. The learning rate α _t controls the learning speed, and its value is proportional to the convergence speed, but it cannot be too large, otherwise the convergence will be immature. γ∈[0,1) is used as a discount factor. The larger it is, the more it is inclined to the long-term reward value, and the smaller it is, the more it is inclined to the short-term current reward value. The calculation formula of the Q value is to use the updated iterative calculation to continuously approach the optimal solution after being processed by the Bellman equation. Therefore, the reinforcement learning behavior is set to CyclicBehaviour (cyclic class), and the learning process can be repeated continuously.

如图2所示，为本发明的基于JADE平台和强化学习的多Agent应急行动决策方法所结合的有模型强化学习算法结构图。通过监控Agent来管理突发公共事件的应急行动。应急资源仓库Agent与环境(监控Agent)双方具体的交互方法如下：应急资源仓库Agent建立JADE交互协议的发起类，监控Agent建立JADE交互协议的发响应类，利用JADE本体类定义Concept结构的回馈值r_t+1，Predicate结构的状态s_t，Action结构的动作a_t，并予以通信交互这些信息。As shown in FIG. 2 , it is a structural diagram of a modeled reinforcement learning algorithm combined with the multi-agent emergency action decision-making method based on the JADE platform and reinforcement learning of the present invention. Manage emergency actions for public emergencies by monitoring Agents. The specific interaction methods between the emergency resource warehouse Agent and the environment (monitoring Agent) are as follows: the emergency resource warehouse Agent establishes the initiation class of the JADE interaction protocol, the monitoring agent establishes the JADE interaction protocol sending and response class, and uses the JADE ontology class to define the feedback value of the Concept structure r _t+1 , the state s _t of the Predicate structure, the action a _t of the Action structure, and communicate and exchange these information.

本发明基于多Agent技术、JADE平台和强化学习方法，从应急处置行动的全局出发，将多Agent技术与强化学习算法相结合，建立一个较智能的决策支持方法，通过综合考虑时间、成本、有效性等要素对各个应急资源仓库进行全局调配，充分有效地利用各个应急资源仓库进行应急保障工作，同时将多Agent思想应用到决策支持系统中，极大地增强了系统自适应能力；再利用强化学习提高各个Agent间的协调性，促进系统的智能化，可以与数字化的应急预案系统相结合，利用已有的监控数据信息和案例库进行计算机辅助决策，更加科学有效地指挥应急行动，系统能够有效协调经济成本与时间效率的关系，智能性和适应性更强，具有较高的可扩展性和更重要的实际应用价值。Based on multi-agent technology, JADE platform and reinforcement learning method, the present invention combines multi-agent technology and reinforcement learning algorithm to establish a more intelligent decision-making support method from the perspective of the overall situation of emergency response actions, and comprehensively considers time, cost, effective The global allocation of each emergency resource warehouse can be fully and effectively used for emergency support work, and the multi-agent idea is applied to the decision support system, which greatly enhances the self-adaptive ability of the system; and then uses reinforcement learning Improve the coordination among various Agents and promote the intelligence of the system. It can be combined with the digital emergency plan system, and use the existing monitoring data information and case database for computer-aided decision-making, and command emergency actions more scientifically and effectively. The system can effectively Coordinating the relationship between economic cost and time efficiency, it is more intelligent and adaptable, has higher scalability and more important practical application value.

Claims

1. A multi-Agent emergency action decision-making method based on JADE platform and reinforcement learning, is characterized in that, comprises the steps:

Step 1. Start the JADE platform and establish a monitoring agent. Use the monitoring agent to judge whether there is a public emergency in real time. If there is a public emergency, go directly to step 2. If there is no public emergency, repeat this step. continue to judge;

Step 2, register the emergency resource guarantee service behavior of each emergency resource warehouse Agent on the monitoring agent, and execute the reinforcement learning of each emergency resource warehouse Agent, and obtain the reinforcement learning feedback value corresponding to each emergency resource warehouse Agent from the monitoring agent;

Step 3, select one or more emergency resource warehouse Agents from each reinforcement learning feedback value to join the emergency resource deployment sequence;

In step 2, the specific steps of the reinforcement learning of the emergency resource warehouse Agent are as follows:

Step a, initialize learning rate Λ _t , discount factor γ and Q value;

Step b, each emergency resource warehouse agent obtains the current state s _t through the interaction between the initiation class of the JADE interaction protocol and the response class of the environment, and selects the optimal action a _t under the current state s _t according to the state transition function P, and executes the action a _t transfers to a new state s _t+1 ;

Step c, the emergency resource warehouse Agent uses the initiation class of the JADE interaction protocol to obtain the return value r _t+1 from the external environment, and update the Q value;

Step d, quit reinforcement learning after the Q value converges.

2. The multi-Agent emergency action decision-making method based on the JADE platform and reinforcement learning according to claim 1, wherein in step 2, the monitoring Agent searches for all possible emergency resource warehouse Agents in real time through the Yellow Pages service of the JADE platform.

3. The multi-Agent emergency action decision-making method based on JADE platform and reinforcement learning according to claim 1, characterized in that, in step b, utilize the content language of JADE to store the response of the initiation class and environment of JADE interactive protocol in ontology form Class interaction information.

4. the multi-Agent emergency action decision-making method based on JADE platform and reinforcement learning according to claim 3, is characterized in that, in step c, the computing formula of Q value is:

In the formula, γ∈[0,1) is the discount factor, Λ _t is the learning rate, A is the action set, S is the state set, Q _t ₍ s _t , a _t ₎ represents the q value, Q _t+1 (s _t , a _t ) represents the update value at time t+1, and max _a∈A Q(s ^′ , a ^′ ) represents the maximum value in these Q value tables.

5. the multi-Agent emergency action decision-making method based on JADE platform and reinforcement learning according to claim 4, is characterized in that, action set A={a1, a2}, state set S={C1, C2, D, F1, F2}, C1 represents the inventory capacity that the emergency resource warehouse Agent can effectively provide, C2 represents the type of emergency supplies that the emergency resource warehouse Agent can effectively provide, D represents the distance between the emergency resource warehouse Agent and the place where public emergencies occur, and F1 represents the unit distance The transportation cost of emergency resources, F2 means the transportation cost of emergency resources per unit quality, a1 means that the agent of the emergency resource warehouse chooses to join the ranks of emergency resource allocation, and a2 means that the agent of the emergency resource warehouse chooses not to join.