CN113095498A

CN113095498A - Divergence-based multi-agent cooperative learning method, divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and divergence-based multi-agent cooperative learning medium

Info

Publication number: CN113095498A
Application number: CN202110315995.0A
Authority: CN
Inventors: 卢宗青; 苏可凡
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-07-09
Anticipated expiration: 2041-03-24
Also published as: CN113095498B

Abstract

The invention discloses a divergence-based multi-agent cooperative learning method, a divergence-based multi-agent cooperative learning device, divergence-based multi-agent cooperative learning equipment and a divergence-based storage medium, wherein the divergence-based multi-agent cooperative learning method comprises the following steps of: initializing a value network, a policy network and a target policy network; changing the updating modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest updating mode; training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode; and a plurality of agents acquire observation data from the environment, and make a decision by combining the experience data and the updated strategy network to obtain action data. The multi-agent cooperative learning method provided by the embodiment of the disclosure utilizes the regular term based on divergence to enhance the exploration capability of the agents and solve the cooperative problem of the multi-agents.

Description

Divergence-based multi-agent cooperative learning method, device, equipment and medium

技术领域technical field

本发明涉及机器学习技术领域，特别涉及一种基于散度的多智能体合作学习方法、装置、设备及介质。The present invention relates to the technical field of machine learning, in particular to a multi-agent cooperative learning method, device, device and medium based on divergence.

背景技术Background technique

强化学习智能体可通过与环境进行交互的方式完成行为策略的自主学习，因此在诸如机器臂控制、棋牌类游戏以及游戏等单智能体领域的任务中获得成功应用。但是，现实生活中的很多任务往往需要多个智能体通过协作完成，如物流机器人、无人驾驶、大型即时战略游戏等任务。因此，多智能体合作学习在近年来愈发受到关注。Reinforcement learning agents can complete autonomous learning of behavioral policies by interacting with the environment, so they have been successfully applied in tasks such as robotic arm control, chess and card games, and games in single-agent domains. However, many tasks in real life often require multiple agents to cooperate to complete tasks, such as logistics robots, unmanned driving, large-scale real-time strategy games and other tasks. Therefore, multi-agent cooperative learning has attracted more and more attention in recent years.

在协作型多智能体任务中，由于通信限制，每个智能体通常只能感知到自己可视范围内的局部信息。如果每个智能体根据各自的局部信息进行学习，则智能体之间很难形成有效的协作。现有技术中，通过加入熵正则项的方式提高智能体的探索能力，但是加入熵正则项同时也修改了原来的马尔科夫决策过程，这就导致了基于熵正则的强化学习得到的收敛策略并不是原问题的最优策略。会给收敛策略带来偏差。In cooperative multi-agent tasks, due to communication constraints, each agent usually only perceives local information within its own visible range. If each agent learns according to its own local information, it is difficult to form effective cooperation between agents. In the prior art, the exploration ability of the agent is improved by adding the entropy regularization term, but the addition of the entropy regularization term also modifies the original Markov decision process, which leads to the convergence strategy obtained by the reinforcement learning based on the entropy regularization. is not the optimal strategy for the original problem. will bring bias to the convergence strategy.

发明内容SUMMARY OF THE INVENTION

本公开实施例提供了一种基于散度的多智能体合作学习方法、装置、设备及介质。为了对披露的实施例的一些方面有一个基本的理解，下面给出了简单的概括。该概括部分不是泛泛评述，也不是要确定关键/重要组成元素或描绘这些实施例的保护范围。其唯一目的是用简单的形式呈现一些概念，以此作为后面的详细说明的序言。Embodiments of the present disclosure provide a divergence-based multi-agent cooperative learning method, apparatus, device, and medium. In order to provide a basic understanding of some aspects of the disclosed embodiments, a brief summary is given below. This summary is not intended to be an extensive review, nor is it intended to identify key/critical elements or delineate the scope of protection of these embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the detailed description that follows.

第一方面，本公开实施例提供了一种基于散度的多智能体合作学习方法，包括：In a first aspect, an embodiment of the present disclosure provides a multi-agent cooperative learning method based on divergence, including:

初始化值网络、策略网络以及目标策略网络；Initialize the value network, policy network and target policy network;

根据预设的基于散度的正则项改变值网络、策略网络、目标策略网络的更新方式，得到最新更新方式；According to the preset divergence-based regular term, change the update method of the value network, the strategy network, and the target strategy network, and obtain the latest update method;

多个智能体根据值网络、策略网络以及目标策略网络进行训练，得到经验数据，并根据经验数据以及最新更新方式，更新值网络、策略网络以及目标策略网络；Multiple agents are trained according to the value network, strategy network and target strategy network to obtain experience data, and update the value network, strategy network and target strategy network according to the experience data and the latest update method;

多个智能体从环境中获取观察数据，并结合经验数据以及更新后的策略网络进行决策，得到行动数据。Multiple agents obtain observation data from the environment, and combine the empirical data and the updated policy network to make decisions to obtain action data.

在一个可选地实施例中，根据预设的基于散度的正则项改变值网络、策略网络、目标策略网络的更新方式之前，还包括：In an optional embodiment, before changing the update modes of the value network, the strategy network, and the target strategy network according to a preset divergence-based regular term, the method further includes:

根据基于散度的正则项构建多智能体的最大化目标函数。Construct a multi-agent maximization objective function based on a divergence-based regularization term.

在一个可选地实施例中，基于散度的正则项为：

In an optional embodiment, the regularization term based on divergence is:

其中，π表示策略网络，a_t表示动作，s_t表示状态，ρ表示目标策略网络。Among them, π represents the policy network, a _t represents the action, s _t represents the state, and ρ represents the target policy network.

在一个可选地实施例中，根据预设的基于散度的正则项，改变值网络的更新方式，包括：In an optional embodiment, the update method of the value network is changed according to a preset divergence-based regular term, including:

值网络根据最大化目标函数更新：The value network is updated according to the maximizing objective function:

其中，λ是正则项系数，π表示策略网络，ρ表示目标策略网络，φ是值网络的参数，

是目标值网络的参数，s表示环境中的全部信息，a表示动作，y表示需要拟合的目标值，r表示全局奖励，E表示数学期望，γ表示折扣因子，s′表示智能体决策后环境转移到的新状态，a′表示智能体在新的状态下做出的动作，

表示值网络的损失函数，

表示目标值网络，Q_φ表示值网络，τ表示目标值网络进行滑动平均更新的参数。where λ is the regularization term coefficient, π is the policy network, ρ is the target policy network, φ is the parameter of the value network,

is the parameter of the target value network, s represents all the information in the environment, a represents the action, y represents the target value to be fitted, r represents the global reward, E represents the mathematical expectation, γ represents the discount factor, and s' represents the agent after the decision. The new state to which the environment is transferred, a' represents the action taken by the agent in the new state,

represents the loss function of the value network,

represents the target value network, Q _φ represents the value network, and τ represents the parameters of the target value network for moving average update.

在一个可选地实施例中，根据预设的基于散度的正则项，改变策略网络的更新方式，包括：In an optional embodiment, changing the update method of the policy network according to a preset divergence-based regular term, including:

策略网络根据最小化目标函数更新：The policy network is updated according to the minimized objective function:

其中，θ_i表示策略网络π_i的参数，

表示策略网络的损失函数，E表示数学期望，a_i表示动作，s表示状态，

表示在给定策略π和目标策略ρ的情况下的值函数，λ表示正则项系数，ρ_i表示目标策略网络，D_KL表示KL散度。where θ _i represents the parameters of the policy network π _i ,

represents the loss function of the policy network, E represents the mathematical expectation, a _i represents the action, s represents the state,

denote the value function given a policy π and a target policy ρ, λ denotes the regularization term coefficient, ρ _i denotes the target policy network, and D _KL denotes the KL divergence.

在一个可选地实施例中，根据预设的基于散度的正则项，改变目标策略网络的更新方式，包括：In an optional embodiment, according to a preset divergence-based regular term, changing the update method of the target policy network, including:

目标策略网络根据策略网络的滑动平均更新：The target policy network is updated according to the moving average of the policy network:

其中，

表示智能体i的目标策略网络的参数，τ表示目标策略网络用于滑动平均更新的参数，θ_i表示策略网络的参数。in,

represents the parameters of the target policy network of agent i, τ represents the parameters of the target policy network for moving average update, and θ _i represents the parameters of the policy network.

在一个可选地实施例中，多个智能体根据值网络、策略网络以及目标策略网络进行训练，得到经验数据，并根据经验数据以及最新更新方式，更新值网络、策略网络以及目标策略网络，包括：In an optional embodiment, multiple agents are trained according to the value network, the strategy network and the target strategy network to obtain experience data, and update the value network, the strategy network and the target strategy network according to the experience data and the latest update method, include:

智能体从环境中获取观察数据，并利用经验数据，根据策略网络进行决策，得到行动数据；The agent obtains observation data from the environment, and uses the empirical data to make decisions according to the policy network to obtain action data;

环境根据当前的状态和联合动作给出奖励，并且转移到下一个状态；The environment gives rewards based on the current state and joint action, and moves to the next state;

将多元组作为一条经验数据存入缓存数据库中，多元组包括当前环境状态、当前动作、下一个环境状态、观察数据、全局奖励；The tuple is stored in the cache database as a piece of experience data, and the tuple includes the current environment state, current action, next environment state, observation data, and global reward;

当一段经历结束后，从缓存数据库中获取若干条经验数据进行训练，更新值网络、策略网络、以及目标策略网络。When a period of experience is over, several pieces of experience data are obtained from the cache database for training, and the value network, policy network, and target policy network are updated.

第二方面，本公开实施例提供一种基于散度的多智能体合作学习装置，包括：In a second aspect, an embodiment of the present disclosure provides a multi-agent cooperative learning device based on divergence, including:

初始化模块，用于初始化值网络、策略网络以及目标策略网络；The initialization module is used to initialize the value network, the strategy network and the target strategy network;

改变模块，用于根据预设的基于散度的正则项改变值网络、策略网络、目标策略网络的更新方式，得到最新更新方式；a change module, used for changing the update mode of the value network, the strategy network and the target strategy network according to the preset divergence-based regular term, so as to obtain the latest update mode;

训练模块，用于多个智能体根据值网络、策略网络以及目标策略网络进行训练，得到经验数据，并根据经验数据以及最新更新方式，更新值网络、策略网络以及目标策略网络；The training module is used for multiple agents to train according to the value network, the strategy network and the target strategy network, obtain the experience data, and update the value network, the strategy network and the target strategy network according to the experience data and the latest update method;

执行模块，用于多个智能体从环境中获取观察数据，并结合经验数据以及更新后的策略网络进行决策，得到行动数据。The execution module is used for multiple agents to obtain observation data from the environment, and combine the experience data and the updated policy network to make decisions to obtain action data.

第三方面，本公开实施例提供一种基于散度的多智能体合作学习设备，包括处理器和存储有程序指令的存储器，处理器被配置为在执行程序指令时，执行上述实施例提供的基于散度的多智能体合作学习方法。In a third aspect, an embodiment of the present disclosure provides a multi-agent cooperative learning device based on divergence, including a processor and a memory storing program instructions. Divergence-based multi-agent cooperative learning method.

第四方面，本公开实施例提供一种计算机可读介质，其上存储有计算机可读指令，计算机可读指令可被处理器执行以实现上述实施例提供的一种基于散度的多智能体合作学习方法。In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which computer-readable instructions are stored, and the computer-readable instructions can be executed by a processor to implement a divergence-based multi-agent provided by the foregoing embodiments Cooperative learning methods.

本公开实施例提供的技术方案可以包括以下有益效果：The technical solutions provided by the embodiments of the present disclosure may include the following beneficial effects:

本公开实施例提供了一种多智能体的合作学习方法，用于解决多智能体的合作问题。该方法为智能体添加了一个目标策略，并利用目标策略在一般的马尔科夫决策过程的奖励函数中加入了一个基于散度的正则项，并且针对基于散度的马尔科夫决策过程提出了新的策略更新方式，本公开实施例利用基于散度的正则项增强了智能体的探索能力，同时在智能体的策略收敛时，该正则项将会消失，避免了加入正则项给收敛策略带来的偏差。并且加入该正则项后，可以实现离策略的训练，解决了多智能体环境中的采样效率问题。同时该正则项还控制了策略更新的步长，增强了策略提升的稳定性。本公开实施例提出的方法具有一定的灵活性，在智能交通，机器人控制等领域具有重要应用价值。The embodiments of the present disclosure provide a multi-agent cooperative learning method, which is used to solve the multi-agent cooperation problem. This method adds a target policy to the agent, and uses the target policy to add a divergence-based regularization term to the reward function of the general Markov decision process, and proposes a divergence-based Markov decision process. The new policy update method, the embodiment of the present disclosure enhances the exploration ability of the agent by using the regular term based on divergence, and at the same time, when the policy of the agent converges, the regular term will disappear, avoiding adding regular terms to the convergence policy. come deviation. And after adding this regular term, off-policy training can be realized, which solves the problem of sampling efficiency in a multi-agent environment. At the same time, the regular term also controls the step size of policy update, which enhances the stability of policy promotion. The method proposed by the embodiments of the present disclosure has certain flexibility, and has important application value in the fields of intelligent transportation and robot control.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本发明。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本发明的实施例，并与说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description serve to explain the principles of the invention.

图1是根据一示例性实施例示出的一种基于散度的多智能体合作学习方法的流程示意图；1 is a schematic flowchart of a divergence-based multi-agent cooperative learning method according to an exemplary embodiment;

图2是根据一示例性实施例示出的一种基于散度的多智能体训练方法的流程示意图；2 is a schematic flowchart of a multi-agent training method based on divergence according to an exemplary embodiment;

图3是根据一示例性实施例示出的一种基于散度的多智能体合作学习装置的结构示意图；3 is a schematic structural diagram of a divergence-based multi-agent cooperative learning apparatus according to an exemplary embodiment;

图4是根据一示例性实施例示出的一种基于散度的多智能体合作学习设备的结构示意图；4 is a schematic structural diagram of a divergence-based multi-agent cooperative learning device according to an exemplary embodiment;

图5是根据一示例性实施例示出的一种计算机存储介质的示意图。Fig. 5 is a schematic diagram of a computer storage medium according to an exemplary embodiment.

具体实施方式Detailed ways

以下描述和附图充分地示出本发明的具体实施方案，以使本领域的技术人员能够实践它们。The following description and drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.

应当明确，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。It should be understood that the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反，它们仅是如所附权利要求书中所详述的、本发明的一些方面相一致的系统和方法的例子。Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with the present invention. Rather, they are merely examples of systems and methods consistent with some aspects of the invention, as recited in the appended claims.

在本发明的描述中，需要理解的是，术语“第一”、“第二”等仅用于描述目的，而不能理解为指示或暗示相对重要性。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。此外，在本发明的描述中，除非另有说明，“多个”是指两个或两个以上。“和/或”，描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。In the description of the present invention, it should be understood that the terms "first", "second" and the like are used for descriptive purposes only, and should not be construed as indicating or implying relative importance. For those of ordinary skill in the art, the specific meanings of the above terms in the present invention can be understood in specific situations. Furthermore, in the description of the present invention, unless otherwise specified, "a plurality" means two or more. "And/or", which describes the association relationship of the associated objects, means that there can be three kinds of relationships, for example, A and/or B, which can mean that A exists alone, A and B exist at the same time, and B exists alone. The character "/" generally indicates that the associated objects are an "or" relationship.

在多智能体的合作问题中，环境的状态，即环境中的全部信息为s。每个智能体i可以获得环境中的部分信息o_i，根据历史信息τ_i＝{(o_i，a_i)}以及自己的策略π_i得到并执行自己的动作a_i，根据当时的状态s和所有智能体的联合动作a＝(a₁，a₂，…，a_n)并获得相同的全局奖励r(s，a)，同时环境转移至下一个状态s′～P(·|s，a)。记所有智能体的联合策略为π＝π₁×π₂×…×π_n，则每个智能体最大化共同的目标函数为：In the multi-agent cooperation problem, the state of the environment, that is, the total information in the environment is s. Each agent i can obtain partial information o _i in the environment, obtain and execute its own action a _i according to the historical information τ _i ={(o _i , a _i )} and its own strategy π _i , according to the current state s and the _joint action a=(a ₁ , a ₂ , . a). Note that the joint strategy of all agents is π=π ₁ ×π ₂ ×…×π _n , then the common objective function that each agent maximizes is:

其中，E表示数学期望，π表示策略，a_t表示动作，s_t表示状态，γ表示折扣因子。where E is the mathematical expectation, π is the policy, at is the action, s _t _is the state, and γ is the discount factor.

基于熵正则的强化学习主要是在奖励函数中减去logπ(a_t|s_t)，即将智能体的优化目标修改为：The reinforcement learning based on entropy regularization mainly subtracts logπ(a _t |s _t ) from the reward function, that is, the optimization objective of the agent is modified as:

其中，E表示数学期望，π表示策略，a_t表示动作，s_t表示状态，γ表示折扣因子，λ表示正则项的系数。熵正则项的加入可以提高智能体的探索能力，但是熵正则项同时也修改了原来的马尔科夫决策过程，这就导致了基于熵正则的强化学习得到的收敛策略并不是原问题的最优策略，会给收敛策略带来偏差。Among them, E is the mathematical expectation, π is the policy, at is the action, s _t _is the state, γ is the discount factor, and λ is the coefficient of the regular term. The addition of the entropy regularization term can improve the exploration ability of the agent, but the entropy regularization term also modifies the original Markov decision process, which leads to the convergence strategy obtained by the reinforcement learning based on the entropy regularization is not the optimal of the original problem strategy, which will bias the convergence strategy.

本公开实施例提出了一种新的基于散度的正则项，并且针对基于散度的马尔科夫决策过程提出了新的策略更新方式，利用基于散度的正则项增强了智能体的探索能力，同时在智能体的策略收敛时，该正则项将会消失，避免了加入正则项给收敛策略带来的偏差。并且加入该正则项后，可以实现离策略的训练，解决了多智能体环境中的采样效率问题。同时该正则项还控制了策略更新的步长，增强了策略提升的稳定性。The embodiment of the present disclosure proposes a new divergence-based regular term, and proposes a new policy update method for divergence-based Markov decision process, using the divergence-based regular term to enhance the exploration ability of the agent , and when the agent's policy converges, the regular term will disappear, avoiding the deviation of adding regular term to the convergence policy. And after adding this regular term, off-policy training can be realized, which solves the problem of sampling efficiency in a multi-agent environment. At the same time, the regular term also controls the step size of policy update, which enhances the stability of policy promotion.

下面将结合附图1-附图2，对本申请实施例提供的基于散度的多智能体合作学习方法进行详细介绍。The divergence-based multi-agent cooperative learning method provided by the embodiments of the present application will be described in detail below with reference to FIG. 1 to FIG. 2 .

该方法适用的情景是存在多个合作的智能体，每个智能体在训练过程中与环境进行交互，在现实生活中，很多任务往往需要多个智能体通过协作才能完成，例如物流机器人、无人驾驶、大型即时战略游戏等，这种需要通过多个智能体协作才能完成任务的系统称为多智能体系统，即多智能体。例如仓库物流系统为一个多智能体，其中的每一个物流机器人为一个智能体。The applicable scenario of this method is that there are multiple cooperative agents, and each agent interacts with the environment during the training process. In real life, many tasks often require the cooperation of multiple agents to complete, such as logistics robots, wireless Human driving, large-scale real-time strategy games, etc. This kind of system that requires the cooperation of multiple agents to complete tasks is called a multi-agent system, that is, multi-agent. For example, the warehouse logistics system is a multi-agent, and each logistics robot is an agent.

参见图1，该方法具体包括以下步骤。Referring to Figure 1, the method specifically includes the following steps.

S101初始化值网络Q、策略网络π_i以及目标策略网络ρ_i。S101 initializes the value network Q, the strategy network π _i and the target strategy network ρ _i .

S102根据预设的基于散度的正则项改变值网络、策略网络、目标策略网络的更新方式，得到最新更新方式。S102 changes the update modes of the value network, the policy network, and the target policy network according to the preset divergence-based regular term, and obtains the latest update mode.

在一种可能的实现方式中，在执行步骤S102之前，还包括根据基于散度的正则项构建多智能体的最大化目标函数。In a possible implementation manner, before step S102 is performed, the method further includes constructing the maximization objective function of the multi-agent according to the regularization term based on the divergence.

具体地，给每个智能体增加一个目标策略ρ_i，利用新增的目标策略，在奖励函数中减去一个

作为正则项，称之为基于散度的正则项。其中，π表示策略网络，a_t表示动作，s_t表示状态，ρ表示目标策略网络。Specifically, add a target policy ρ _i to each agent, and use the newly added target policy to subtract a target policy from the reward function

As a regular term, it is called a divergence-based regular term. Among them, π represents the policy network, a _t represents the action, s _t represents the state, and ρ represents the target policy network.

进一步地，利用基于散度的正则项得到多智能体的最大化目标函数为：Further, the maximization objective function of multi-agent is obtained by using the regularization term based on divergence as:

其中，E表示数学期望，π表示策略，a_t表示动作，s_t表示状态，γ表示折扣因子，λ表示正则项的系数，ρ表示目标策略网络。Among them, E is the mathematical expectation, π is the policy, at is the action, s _t _is the state, γ is the discount factor, λ is the coefficient of the regular term, and ρ is the target policy network.

本公开实施例将熵正则项改变为基于散度的正则项，并利用基于散度的正则项得到了新的优化目标函数。当把目标策略取为过去的策略，那么当某个动作的概率上升时，正则项会给降低奖励，反之则会提高奖励，即正则项鼓励智能体采取概率下降的行动，不鼓励智能体采取概率上升的行动，这样就能在训练过程中更充分地进行探索。同时该正则项的加入实际上也控制了策略更新的步长，即新的策略不能偏离过去的策略太多，这也有利于策略提升的稳定性。并且，加入基于散度的正则项后，根据推导得到最新更新方式，可以实现离策略(off-policy)的训练，这样就解决了在多智能体情形下基于策略的强化学习方法面对的采样效率的问题。In the embodiment of the present disclosure, the entropy regularization term is changed to a divergence-based regularization term, and a new optimization objective function is obtained by using the divergence-based regularization term. When the target strategy is taken as the past strategy, when the probability of a certain action increases, the regular term will reduce the reward, otherwise it will increase the reward, that is, the regular term encourages the agent to take actions with a decreasing probability, and discourages the agent to take Actions with increasing probability so that they can be explored more fully during training. At the same time, the addition of the regular term actually controls the step size of the policy update, that is, the new policy cannot deviate too much from the past policy, which is also conducive to the stability of policy improvement. Moreover, after adding the regular term based on divergence, the latest update method can be obtained according to the derivation, which can realize off-policy training, thus solving the sampling problem faced by the policy-based reinforcement learning method in the multi-agent situation. question of efficiency.

具体地，根据预设的基于散度的正则项，改变值网络的更新方式，包括：Specifically, according to the preset divergence-based regular term, change the update method of the value network, including:

在一般的强化学习中，值函数通常定义为：In general reinforcement learning, the value function is usually defined as:

其中，Q表示值函数，π表示策略，a_t表示动作，s_t表示状态，γ表示折扣因子，E表示数学期望，r表示全局奖励。where Q is the value function, π is the policy, at is the action, s _t _is the state, γ is the discount factor, E is the mathematical expectation, and r is the global reward.

在训练过程中则往往根据贝尔曼最优方程进行更新，即：In the training process, it is often updated according to the Bellman optimal equation, namely:

其中，Q^*表示更新的值函数，π表示策略，a表示动作，s表示状态，γ表示折扣因子，E表示数学期望，r表示全局奖励，s′表示下一时刻的状态，a′表示下一时刻的动作。Among them, Q ^* represents the updated value function, π represents the policy, a represents the action, s represents the state, γ represents the discount factor, E represents the mathematical expectation, r represents the global reward, s' represents the state at the next moment, and a' represents the next a moment of action.

在加入基于散度的正则项后，得到了新的关于最优的值函数的迭代公式，即：After adding the regularization term based on divergence, a new iterative formula for the optimal value function is obtained, namely:

其中，

表示最优的值函数，a表示动作，s表示状态，γ表示折扣因子，E表示数学期望，r表示全局奖励，s′表示下一时刻的状态，a′表示下一时刻的动作，λ表示正则项的系数，ρ表示目标策略网络，

表示最优策略。该公式除了加入基于散度的正则项之外，相比于贝尔曼最优方程，其并不限制a′为最优动作，可以任取，这样就增加了训练时的灵活性，同时可以在离策略训练中采用TD(λ)的目标函数设计。in,

Represents the optimal value function, a represents the action, s represents the state, γ represents the discount factor, E represents the mathematical expectation, r represents the global reward, s' represents the state at the next moment, a' represents the action at the next moment, and λ represents the The coefficient of the regular term, ρ represents the target policy network,

represents the optimal strategy. In addition to adding a regular term based on divergence, compared to the Bellman optimal equation, this formula does not restrict a' to be the optimal action, which can be taken arbitrarily, which increases the flexibility during training, and can be used in The objective function design of TD(λ) is adopted in off-policy training.

根据上面的迭代公式，设计得到值网络根据最大化目标函数更新：According to the above iterative formula, the designed value network is updated according to the maximization objective function:

表示值网络的损失函数，

定示目标值网络，Q_φ表示值网络，τ表示目标值网络进行滑动平均更新的参数。where λ is the regularization term coefficient, π is the policy network, ρ is the target policy network, φ is the parameter of the value network,

represents the loss function of the value network,

The target value network is fixed, Q _φ represents the value network, and τ represents the parameters of the moving average update of the target value network.

进一步地，根据预设的基于散度的正则项，改变策略网络的更新方式，包括：Further, according to the preset divergence-based regular term, change the update method of the policy network, including:

基于策略的强化学习方法通常都是根据策略梯度定理得到策略网络的更新方式，但是策略梯度定理则限制了更新必须是在策略(on-policy)的。在加入基于散度的正则项后，得到了新的策略提升定理，即给定一个旧的策略π_old，只需要新的策略π_new满足The policy-based reinforcement learning method usually obtains the update method of the policy network according to the policy gradient theorem, but the policy gradient theorem restricts that the update must be on-policy. After adding the regularization term based on divergence, a new strategy promotion theorem is obtained, that is, given an old strategy π _old , only the new strategy π _new needs to satisfy

其中

D_KL代表KL散度，那么π_new就比π_old更优。根据上面的定理，经过推导可以根据下面的公式来计算策略的梯度：in

D _KL stands for KL divergence, then π _new is better than π _old . According to the above theorem, after derivation, the gradient of the policy can be calculated according to the following formula:

其中，θ_i表示策略网络π_i的参数，

表示用于存放训练数据的缓存数据库，

表示在给定策略π和目标策略ρ的情况下的值函数，λ表示正则项系数，ρ表示目标策略网络。where θ _i represents the parameters of the policy network π _i ,

Represents a cache database for storing training data,

represents the value function given a policy π and a target policy ρ, λ represents the regularization term coefficient, and ρ represents the target policy network.

上面的梯度公式，除了加入正则项外，与策略梯度定理最大的区别就在于其不再限制状态s的分布，因此可以利用上面的更新方式进行离策略的更新。在多智能体合作的设定下，可以在梯度中减去一个反事实的基线函数，来解决信用分配的问题，同时减小梯度更新的方差，除此之外，反事实的基线函数还可以减少计算。取基线函数为

则梯度公式可进一步改写为下述，策略网络根据最小化目标函数的更新方法也如下述公式所示：The above gradient formula, in addition to adding regular terms, is different from the policy gradient theorem in that it no longer restricts the distribution of state s, so the above update method can be used to update off-policy. In the setting of multi-agent cooperation, a counterfactual baseline function can be subtracted from the gradient to solve the problem of credit assignment and reduce the variance of the gradient update. In addition, the counterfactual baseline function can also Reduce calculations. Take the baseline function as

Then the gradient formula can be further rewritten as follows, and the update method of the policy network according to the minimized objective function is also shown in the following formula:

其中，θ_i表示策略网络π_i的参数，

进一步地，根据预设的基于散度的正则项，改变目标策略网络的更新方式，包括：Further, according to the preset divergence-based regular term, change the update method of the target policy network, including:

对于目标策略，在最理想的情况下，应该采用下面的迭代方式：For the target strategy, in the most ideal case, the following iterations should be used:

即取定ρ＝π^t的情况下得到最优策略π^t+1，当这样的迭代过程收敛时，π^t+1＝π^t，进而有

从而收敛的策略就是原马尔科夫决策过程的最优策略，也就避免了加入正则项给收敛策略带来的偏差。然而，这样的理想情形每一步迭代都需要通过强化学习的训练直到收敛，代价过大，因此可以采用下面的近似方法：That is, the optimal strategy π ^t+1 is obtained under the condition of ρ=π ^t . When such an iterative process converges, π ^t+1 =π ^t , and then there are

Therefore, the convergence strategy is the optimal strategy of the original Markov decision process, which avoids the deviation caused by adding regular terms to the convergence strategy. However, in such an ideal situation, each iteration needs to be trained by reinforcement learning until convergence, and the cost is too high, so the following approximation method can be used:

其中，

表示智能体i的目标策略网络的参数，τ表示目标策略网络用于滑动平均更新的参数，θ_i表示策略网络的参数。即用一步梯度更新来代替最大化，同时让目标策略取为策略的滑动平均。in,

represents the parameters of the target policy network of agent i, τ represents the parameters of the target policy network for moving average update, and θ _i represents the parameters of the policy network. That is, a step gradient update is used to replace the maximization, and the target policy is taken as the sliding average of the policy.

其中，

根据该步骤，可以根据基于散度的正则项得到值网络、策略网络、目标策略网络的最新更新方式，通过最新更新方式，在智能体的策略收敛时，该正则项将会消失，避免了加入正则项给收敛策略带来的偏差，还可以实现离策略的训练，解决了多智能体环境中的采样效率问题，还控制了策略更新的步长，增强了策略提升的稳定性。According to this step, the latest update method of the value network, policy network, and target policy network can be obtained according to the regular term based on divergence. Through the latest update method, when the agent's policy converges, the regular term will disappear, avoiding adding The deviation brought by the regular term to the convergence strategy can also realize off-policy training, which solves the problem of sampling efficiency in a multi-agent environment, and also controls the step size of the policy update, which enhances the stability of the policy improvement.

S103多个智能体根据值网络、策略网络以及目标策略网络进行训练，得到经验数据，并根据经验数据以及最新更新方式，更新值网络、策略网络以及目标策略网络。S103 Multiple agents are trained according to the value network, the strategy network and the target strategy network to obtain experience data, and update the value network, the strategy network and the target strategy network according to the experience data and the latest update method.

图2是根据一示例性实施例示出的一种基于散度的多智能体训练方法的流程示意图，如图2所示，多智能体的训练过程包括：S201智能体从环境中获取观察数据，并利用经验数据，根据策略网络进行决策，得到行动数据；S202环境根据当前的状态和联合动作给出奖励，并且转移到下一个状态；S203将多元组作为一条经验数据存入缓存数据库中，多元组包括当前环境状态、当前动作、下一个环境状态、观察数据、全局奖励；S204当一段经历结束后，从缓存数据库中获取若干条经验数据进行训练，更新值网络、策略网络、以及目标策略网络。Fig. 2 is a schematic flowchart of a divergence-based multi-agent training method according to an exemplary embodiment. As shown in Fig. 2, the multi-agent training process includes: S201, the agent obtains observation data from the environment, And use the experience data to make decisions according to the policy network to obtain action data; S202 the environment gives rewards according to the current state and joint action, and transfers to the next state; S203 stores the tuple as a piece of experience data in the cache database, and the multivariate The group includes the current environment state, the current action, the next environment state, the observation data, and the global reward; S204, when a period of experience ends, obtain several pieces of experience data from the cache database for training, and update the value network, policy network, and target policy network. .

具体地，通过智能体与环境交互来获取训练数据，即智能体i从环境获取观察数据o_i，并利用经验数据τ_i根据策略π_i来进行决策，得到行动a_i，环境则根据当前的状态s和联合动作a给出奖励r并且转移到下一个状态s′，将多元组(s，a，s′，o，r)作为一条经验数据存入缓存数据库D中。当一段经历结束后，从D中取若干条经验进行训练，并根据最新更新方式更新值网络Q，策略网络π_i和目标策略网络ρ_i。Specifically, the training data is obtained through the interaction between the agent and the environment, that is, the agent i obtains the observation data o _i from the environment, and uses the empirical data τ _i to make decisions according to the strategy π _i to obtain the action a _i , and the environment is based on the current The state s and the joint action a give the reward r and move to the next state s', and store the tuple (s, a, s', o, r) as a piece of empirical data in the cache database D. When a period of experience ends, take several experiences from D for training, and update the value network Q, the strategy network π _i and the target strategy network ρ _i according to the latest update method.

在一个示例性场景中，在游戏星际争霸2，玩家将会操控若干个单位与敌人进行战斗，目标是消灭敌人的所有单位。在训练中我们将每一个单位视为一个智能体，每个智能体都有一个视野范围，其得到的观察数据就是在其视野范围内的单位的相关指标(例如生命值，护盾值，单位的种类等等)，智能体可采取的动作则包括移动，攻击，释放技能等，智能体的策略由策略网络生成，是智能体在当前的状态下可采取的动作的一个概率分布，智能体在伤害敌方单位或者消灭敌方单位时会得到奖励，将每一步的状态(当前游戏的所有信息对应的数据，仅在训练时可用)、所有智能体的观察、所有智能体的动作、做出动作后得到的奖励、转移到的新的状态作为一条经验存储下来，在执行若干步或者一局游戏结束后，我们取出之前得到的经验数据来进行训练，更新智能体的策略，再利用新的策略得到新的经验，重复这个过程，直到智能体取得较好的表现。In an exemplary scenario, in the game StarCraft II, the player will control several units to fight the enemy, with the goal of destroying all of the enemy's units. In training, we regard each unit as an agent, each agent has a field of view, and the observed data obtained are related indicators of the units within its field of view (such as health, shield, unit The agent can take actions such as moving, attacking, releasing skills, etc. The agent's strategy is generated by the policy network, which is a probability distribution of the actions that the agent can take in the current state. When you damage or destroy enemy units, you will get rewards. The status of each step (data corresponding to all the information of the current game, only available during training), the observation of all agents, the actions of all agents, the action of The reward obtained after taking the action and the new state transferred to it are stored as a piece of experience. After performing several steps or after a game is over, we take out the experience data obtained before for training, update the strategy of the agent, and then use the new The strategy gets new experience, and the process is repeated until the agent achieves better performance.

S104多个智能体从环境中获取观察数据，并结合经验数据以及更新后的策略网络进行决策，得到行动数据。S104 Multiple agents obtain observation data from the environment, and combine the experience data and the updated policy network to make decisions to obtain action data.

进一步地，在执行过程中，对于智能体i：每次从环境中获取观察数据o_i，并结合经验数据τ_i根据更新后的策略网络π_i来进行决策，得到行动a_i。Further, in the execution process, for the agent i: each time the observation data o _i is obtained from the environment, and combined with the empirical data τ _i to make a decision according to the updated policy network π _i to obtain the action a _i .

根据本公开实施例提供的多智能体的合作学习方法，解决了多智能体的合作问题，利用基于散度的正则项增强了智能体的探索能力，同时在智能体的策略收敛时，该正则项将会消失，避免了加入正则项给收敛策略带来的偏差。并且加入该正则项后，可以实现离策略的训练，解决了多智能体环境中的采样效率问题。同时该正则项还控制了策略更新的步长，增强了策略提升的稳定性。本公开实施例提出的方法具有一定的灵活性，例如在机器人控制领域，可以解决多个机器人的协同控制问题，在游戏领域，可以解决多个游戏角色的协同控制，作为游戏中多智能体的人工智能系统。According to the cooperative learning method of multi-agent provided by the embodiment of the present disclosure, the cooperative problem of multi-agent is solved, and the regularity based on divergence is used to enhance the exploration ability of the agent, and when the strategy of the agent converges, the regular The term will disappear, avoiding the bias brought by adding a regular term to the convergence strategy. And after adding this regular term, off-policy training can be realized, which solves the problem of sampling efficiency in a multi-agent environment. At the same time, the regular term also controls the step size of policy update, which enhances the stability of policy promotion. The method proposed by the embodiments of the present disclosure has certain flexibility. For example, in the field of robot control, it can solve the problem of cooperative control of multiple robots, and in the field of games, it can solve the cooperative control of multiple game characters. artificial intelligence system.

本公开实施例还提供一种基于散度的多智能体合作学习装置，该装置用于执行上述实施例的基于散度的多智能体合作学习方法，如图3所示，该装置包括：An embodiment of the present disclosure further provides a divergence-based multi-agent cooperative learning device, which is used to execute the divergence-based multi-agent cooperative learning method of the above embodiment. As shown in FIG. 3 , the device includes:

初始化模块301，用于初始化值网络、策略网络以及目标策略网络；The initialization module 301 is used to initialize the value network, the strategy network and the target strategy network;

改变模块302，用于根据预设的基于散度的正则项改变值网络、策略网络、目标策略网络的更新方式，得到最新更新方式；The changing module 302 is configured to change the update mode of the value network, the strategy network, and the target strategy network according to the preset divergence-based regular term, so as to obtain the latest update mode;

训练模块303，用于多个智能体根据值网络、策略网络以及目标策略网络进行训练，得到经验数据，并根据经验数据以及最新更新方式，更新值网络、策略网络以及目标策略网络；The training module 303 is used for a plurality of agents to train according to the value network, the strategy network and the target strategy network to obtain experience data, and to update the value network, the strategy network and the target strategy network according to the experience data and the latest update method;

执行模块304，用于多个智能体从环境中获取观察数据，并结合经验数据以及更新后的策略网络进行决策，得到行动数据。The execution module 304 is used for a plurality of agents to obtain observation data from the environment, and combine the experience data and the updated policy network to make decisions to obtain action data.

需要说明的是，上述实施例提供的基于散度的多智能体合作学习装置在执行基于散度的多智能体合作学习方法时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的基于散度的多智能体合作学习装置与基于散度的多智能体合作学习方法实施例属于同一构思，其体现实现过程详见方法实施例，这里不再赘述。It should be noted that, when the divergence-based multi-agent cooperative learning device provided in the above embodiment executes the divergence-based multi-agent cooperative learning method, only the division of the above functional modules is used as an example for illustration. In practical applications, The above-mentioned function allocation can be completed by different function modules according to requirements, that is, the internal structure of the device is divided into different function modules, so as to complete all or part of the functions described above. In addition, the divergence-based multi-agent cooperative learning device provided by the above embodiments and the divergence-based multi-agent cooperative learning method embodiments belong to the same concept, and the implementation process is detailed in the method embodiments, which will not be repeated here.

本公开实施例还提供一种与前述实施例所提供的基于散度的多智能体合作学习方法对应的电子设备，以执行上述基于散度的多智能体合作学习方法。Embodiments of the present disclosure further provide an electronic device corresponding to the divergence-based multi-agent cooperative learning method provided by the foregoing embodiments, so as to execute the above-mentioned divergence-based multi-agent cooperative learning method.

请参考图4，其示出了本申请的一些实施例所提供的一种电子设备的示意图。如图4所示，电子设备包括：处理器400，存储器401，总线402和通信接口403，处理器400、通信接口403和存储器401通过总线402连接；存储器401中存储有可在处理器400上运行的计算机程序，处理器400运行计算机程序时执行本申请前述任一实施例所提供的基于散度的多智能体合作学习方法。Please refer to FIG. 4 , which shows a schematic diagram of an electronic device provided by some embodiments of the present application. As shown in FIG. 4 , the electronic device includes: a processor 400 , a memory 401 , a bus 402 and a communication interface 403 , the processor 400 , the communication interface 403 and the memory 401 are connected through the bus 402 ; The running computer program, when the processor 400 runs the computer program, executes the divergence-based multi-agent cooperative learning method provided by any of the foregoing embodiments of the present application.

其中，存储器401可能包含高速随机存取存储器(RAM：Random Access Memory)，也可能还包括非不稳定的存储器(non-volatile memory)，例如至少一个磁盘存储器。通过至少一个通信接口403(可以是有线或者无线)实现该系统网元与至少一个其他网元之间的通信连接，可以使用互联网、广域网、本地网、城域网等。The memory 401 may include a high-speed random access memory (RAM: Random Access Memory), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 403 (which may be wired or wireless), and the Internet, a wide area network, a local network, a metropolitan area network, etc. may be used.

总线402可以是ISA总线、PCI总线或EISA总线等。总线可以分为地址总线、数据总线、控制总线等。其中，存储器401用于存储程序，处理器400在接收到执行指令后，执行程序，前述本申请实施例任一实施方式揭示的基于散度的多智能体合作学习方法可以应用于处理器400中，或者由处理器400实现。The bus 402 may be an ISA bus, a PCI bus, an EISA bus, or the like. The bus can be divided into address bus, data bus, control bus and so on. The memory 401 is used to store a program, and the processor 400 executes the program after receiving the execution instruction. The divergence-based multi-agent cooperative learning method disclosed in any of the foregoing embodiments of the present application can be applied to the processor 400 , or implemented by the processor 400 .

处理器400可能是一种集成电路芯片，具有信号的处理能力。在实现过程中，上述方法的各步骤可以通过处理器400中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器400可以是通用处理器，包括中央处理器(Central Processing Unit，简称CPU)、网络处理器(Network Processor，简称NP)等；还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器401，处理器400读取存储器401中的信息，结合其硬件完成上述方法的步骤。The processor 400 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 400 or an instruction in the form of software. The above-mentioned processor 400 may be a general-purpose processor, including a central processing unit (CPU for short), a network processor (NP for short), etc.; it may also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component. The methods, steps, and logic block diagrams disclosed in the embodiments of this application can be implemented or executed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 401, and the processor 400 reads the information in the memory 401, and completes the steps of the above method in combination with its hardware.

本申请实施例提供的电子设备与本申请实施例提供的基于散度的多智能体合作学习方法出于相同的发明构思，具有与其采用、运行或实现的方法相同的有益效果。The electronic device provided by the embodiments of the present application and the divergence-based multi-agent cooperative learning method provided by the embodiments of the present application are based on the same inventive concept, and have the same beneficial effects as the methods adopted, operated or implemented.

本申请实施例还提供一种与前述实施例所提供的基于散度的多智能体合作学习方法对应的计算机可读存储介质，请参考图5，其示出的计算机可读存储介质为光盘500，其上存储有计算机程序(即程序产品)，计算机程序在被处理器运行时，会执行前述任意实施例所提供的基于散度的多智能体合作学习方法。Embodiments of the present application also provide a computer-readable storage medium corresponding to the divergence-based multi-agent cooperative learning method provided by the foregoing embodiments, please refer to FIG. 5 , the computer-readable storage medium shown is an optical disc 500 , a computer program (ie, a program product) is stored thereon, and when the computer program is run by the processor, it will execute the divergence-based multi-agent cooperative learning method provided in any of the foregoing embodiments.

需要说明的是，计算机可读存储介质的例子还可以包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他光学、磁性存储介质，在此不再一一赘述。It should be noted that examples of computer-readable storage media may also include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory Memory (RAM), Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash Memory or other optical and magnetic storage media will not be repeated here.

本申请的上述实施例提供的计算机可读存储介质与本申请实施例提供的基于散度的多智能体合作学习方法出于相同的发明构思，具有与其存储的应用程序所采用、运行或实现的方法相同的有益效果。The computer-readable storage medium provided by the above-mentioned embodiments of the present application and the divergence-based multi-agent cooperative learning method provided by the embodiments of the present application are based on the same inventive concept, and have the same inventive concept as the application program they store in, run or implement. method with the same beneficial effect.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

以上实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above embodiments only represent several embodiments of the present invention, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the patent of the present invention. It should be pointed out that for those of ordinary skill in the art, without departing from the concept of the present invention, several modifications and improvements can also be made, which all belong to the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention should be subject to the appended claims.

Claims

1. A divergence-based multi-agent cooperative learning method is characterized by comprising the following steps:

initializing a value network, a policy network and a target policy network;

changing the updating modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest updating mode;

training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode;

and a plurality of agents acquire observation data from the environment, and make a decision by combining the experience data and the updated strategy network to obtain action data.

2. The method of claim 1, wherein before changing the updating modes of the value network, the policy network and the target policy network according to a preset regular term based on divergence, the method further comprises:

and constructing a maximized objective function of the multi-agent according to the divergence-based regularization item.

3. The method of claim 2, based onThe regular term for divergence is:

wherein, pi represents a policy network, a_tRepresents the motion, s_tRepresenting the state and p representing the target policy network.

4. The method of claim 1, wherein changing the update mode of the value network according to a preset regularization term based on divergence comprises:

the value network is updated according to the maximized objective function:

where λ is the regular term coefficient, π represents the policy network, ρ represents the target policy network, φ is a parameter of the value network,

is a parameter of the network of target values, s represents all information in the environment, a represents actions, y represents target values to be fitted, r represents global rewards, E represents mathematical expectations, γ represents a discount factor, s 'represents a new state to which the environment is transferred after the agent makes a decision, a' represents actions made by the agent in the new state,

a loss function representing a network of values,

representing a network of target values, Q_φAnd is used for representing a value network, and tau is used for representing a parameter for performing moving average updating of the target value network.

5. The method of claim 1, wherein changing the update mode of the policy network according to a preset divergence-based regularization term comprises:

the policy network updates according to the minimization objective function:

wherein, theta_iRepresenting a policy network pi_iIs determined by the parameters of (a) and (b),

representing the loss function of the policy network, E representing the mathematical expectation, a_iRepresenting an action, s representing a state,

representing a function of values given a strategy pi and a target strategy p, lambda representing the regular term coefficient, p_iRepresenting a target policy network, D_KLIndicating KL divergence.

6. The method of claim 1, wherein changing the update mode of the target policy network according to a preset divergence-based regularization term comprises:

the target policy network updates according to the running average of the policy network:

wherein,

a parameter representing a target policy network of agent i, τ a parameter of the target policy network for moving average updating, θ_iRepresenting parameters of the policy network.

7. The method of claim 1, wherein training agents according to the value network, the policy network, and the target policy network to obtain experience data, and updating the value network, the policy network, and the target policy network according to the experience data and the latest update method comprises:

the intelligent agent obtains observation data from the environment, and makes a decision according to a policy network by using experience data to obtain action data;

the environment gives a reward according to the current state and the joint action, and moves to the next state;

storing a multi-element group as experience data into a cache database, wherein the multi-element group comprises a current environment state, a current action, a next environment state, observation data and global rewards;

and after a period of experience is finished, acquiring a plurality of pieces of experience data from the cache database for training, and updating the value network, the strategy network and the target strategy network.

8. A divergence-based multi-agent cooperative learning apparatus, comprising:

the initialization module is used for initializing a value network, a strategy network and a target strategy network;

the change module is used for changing the update modes of the value network, the strategy network and the target strategy network according to a preset regular term based on divergence to obtain a latest update mode;

the training module is used for training the plurality of agents according to the value network, the strategy network and the target strategy network to obtain experience data, and updating the value network, the strategy network and the target strategy network according to the experience data and the latest updating mode;

and the execution module is used for acquiring observation data from the environment by a plurality of agents, and making a decision by combining the experience data and the updated strategy network to obtain action data.

9. A divergence-based multi-agent cooperative learning apparatus, comprising a processor and a memory storing program instructions, the processor being configured to, upon execution of the program instructions, perform a divergence-based multi-agent cooperative learning method as claimed in any one of claims 1 to 7.

10. A computer readable medium having computer readable instructions stored thereon which are executable by a processor to implement a divergence-based multi-agent cooperative learning method as claimed in any one of claims 1 to 7.