CN111309880A

CN111309880A - Multi-agent action strategy learning method, device, medium and computing device

Info

Publication number: CN111309880A
Application number: CN202010072011.6A
Authority: CN
Inventors: 黄民烈; 高信龙一
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-19
Anticipated expiration: 2040-01-21
Also published as: CN111309880B

Abstract

The embodiment of the invention provides a multi-agent action strategy learning method, which comprises the following steps: the multi-agent respectively samples corresponding actions according to respective initial action strategies; respectively estimating advantages obtained after the multi-agent executes corresponding actions; and updating the action strategy of each agent based on the advantages obtained after the multi-agent executes the corresponding action, so that the updated action strategies can enable the corresponding agents to obtain higher return. In a machine learning scene oriented to task processing, the method of the invention trains a plurality of intelligent agents which are cooperated with each other (namely simultaneously trains a plurality of action strategies) at the same time instead of adopting a pre-constructed simulator and the intelligent agents for interaction without manual supervision, thereby greatly saving time cost and resources.

Description

Multi-agent action strategy learning method, device, medium and computing device

技术领域technical field

本发明的实施方式涉及强化学习领域，更具体地，本发明的实施方式涉及一种多智能体行动策略学习方法、装置、介质和计算设备。The embodiments of the present invention relate to the field of reinforcement learning, and more particularly, the embodiments of the present invention relate to a multi-agent action strategy learning method, apparatus, medium and computing device.

背景技术Background technique

本部分旨在为权利要求书中陈述的本发明的实施方式提供背景或上下文。此处的描述不因为包括在本部分中就承认是现有技术。This section is intended to provide a background or context for the embodiments of the invention that are recited in the claims. The descriptions herein are not admitted to be prior art by inclusion in this section.

行动策略决定了智能体应采取的下一步操作，在面向任务的系统中起着至关重要的作用。近些年来，策略学习已被广泛地认为属于强化学习(RL)问题。由于RL需要大量的交互来进行策略训练，然而直接与真实用户进行交互费时费力。最常见的方法是开发一个用户模拟器帮助训练以促进目标智能体学习行动策略。The action policy determines the next action the agent should take and plays a crucial role in task-oriented systems. In recent years, policy learning has been widely regarded as a reinforcement learning (RL) problem. Since RL requires a lot of interactions for policy training, interacting directly with real users is time-consuming and labor-intensive. The most common approach is to develop a user simulator to help train the target agent to learn action policies.

但是，设计可靠的用户模拟器并不是一件容易的事，并且常常具有挑战性，因为它等同于构建一个好的智能体。随着对智能体处理更复杂任务的需求的不断增长，构建一个完全基于规则的用户模拟器将是一项艰巨而艰巨的工作，并且需要大量的领域专业知识。However, designing a reliable user simulator is not an easy task and is often challenging because it is equivalent to building a good agent. With the ever-increasing demand for agents to handle more complex tasks, building a fully rule-based user simulator will be a daunting and arduous job and require a lot of domain expertise.

发明内容SUMMARY OF THE INVENTION

在本上下文中，本发明的实施方式期望提供一种多智能体行动策略学习方法、装置、介质和计算设备。In this context, embodiments of the present invention are expected to provide a multi-agent action policy learning method, apparatus, medium and computing device.

在本发明实施方式的第一方面中，提供了一种多智能体行动策略学习方法，包括：In a first aspect of the embodiments of the present invention, a multi-agent action strategy learning method is provided, including:

所述多智能体分别根据各自的初始行动策略采样相应的动作；The multi-agents sample corresponding actions according to their respective initial action strategies;

分别估计所述多智能体执行相应动作后获得的优势；separately estimating the advantages obtained by the multi-agents after performing the corresponding actions;

基于所述多智能体执行相应动作后获得的优势更新各智能体的行动策略，以使更新后的各个行动策略能够使得相应的智能体获得更高的回报。The action strategy of each agent is updated based on the advantages obtained after the multi-agent performs the corresponding action, so that each updated action strategy can enable the corresponding agent to obtain a higher reward.

在本实施方式的一个实施例中，所述多智能体分别根据各自的初始行动策略执行相应的动作之前，所述方法还包括：In an embodiment of this implementation manner, before the multi-agents perform corresponding actions according to their respective initial action strategies, the method further includes:

使用真实行动数据进行预训练以分别得到所述多智能体各自的初始行动策略。Pre-training is performed using real action data to obtain the respective initial action policies of the multi-agents.

在本实施方式的一个实施例中，采用加权逻辑回归法基于真实行动数据进行预训练以分别得到所述多智能体各自的初始行动策略。In an example of this implementation manner, a weighted logistic regression method is used to perform pre-training based on real action data to obtain respective initial action strategies of the multi-agents.

在本实施方式的一个实施例中，分别估计所述多智能体执行相应动作后获得的优势，包括：In an example of this implementation manner, the advantages obtained after the multi-agents perform corresponding actions are estimated respectively, including:

分别获取各智能体执行相应动作后所达到的新状态；Obtain the new state achieved by each agent after performing the corresponding action;

分别计算各智能体当前状态和达到新状态后的回报；Calculate the current state of each agent and the return after reaching the new state;

分别根据各智能体当前状态和达到新状态后的回报计算各智能体执行相应动作后获得的优势。According to the current state of each agent and the reward after reaching the new state, the advantages obtained by each agent after performing the corresponding action are calculated.

在本实施方式的一个实施例中，所述多智能体包括用户智能体和系统智能体，所述用户智能体被配置为以完成任务为目标，所述系统智能体被配置为协助用户智能体完成任务。In one example of this implementation, the multi-agent includes a user agent configured to aim at completing a task and a system agent configured to assist the user agent mission accomplished.

在本实施方式的一个实施例中，各智能体到达某一状态后的回报至少包括其自身的回报和全局回报。In an example of this implementation manner, the rewards after each agent reaches a certain state include at least its own rewards and global rewards.

在本实施方式的一个实施例中，所述用户智能体自身的回报至少包括无行动的惩罚、存在未完成子任务时执行新子任务的惩罚以及用户行动奖励。In an example of this implementation manner, the reward of the user agent itself includes at least a penalty for inaction, a penalty for executing a new subtask when there are uncompleted subtasks, and a user action reward.

在本实施方式的一个实施例中，所述用户行动奖励基于用户智能体是否执行所有利于完成任务的行动确定。In one example of this embodiment, the user action reward is determined based on whether the user agent performs all actions that are beneficial for completing the task.

在本实施方式的一个实施例中，所述系统智能体自身的回报至少由以下之一累积而得：无行动的惩罚、用户智能体请求协助时未立即提供相应协助的惩罚以及子任务完成奖励。In an example of this implementation manner, the reward of the system agent itself is accumulated by at least one of the following: a penalty for inaction, a penalty for not immediately providing corresponding assistance when the user agent requests assistance, and a subtask completion reward .

在本实施方式的一个实施例中，所述全局回报至少由以下之一累积而得效率损失惩罚、子任务完成奖励以及全部任务完成奖励。In an example of this implementation manner, the global reward is accumulated by at least one of the following: an efficiency loss penalty, a subtask completion reward, and a total task completion reward.

在本实施方式的一个实施例中，所述用户行动奖励或全部任务完成奖励在所有行动终止后计算。In an example of this implementation manner, the user action reward or all task completion rewards are calculated after all actions are terminated.

在本实施方式的一个实施例中，各个智能体某一状态的回报由对应的价值网络计算。In an example of this implementation manner, the reward of each agent in a certain state is calculated by the corresponding value network.

在本实施方式的一个实施例中，不同类别的回报分别由不同的价值网络计算。In one example of this embodiment, different types of rewards are calculated by different value networks respectively.

在本实施方式的一个实施例中，在所述多智能体的策略学习过程中，所述价值网络被配置为以最小化采用预设方法预测的回报与所述价值网络计算的回报之间的方差为目标进行优化更新。In one embodiment of this embodiment, during the multi-agent policy learning process, the value network is configured to minimize the difference between the reward predicted by the preset method and the reward calculated by the value network The variance is optimized for the target.

在本实施方式的一个实施例中，所述预设方法被配置为基于某一智能体的状态预测回报。In one embodiment of this implementation, the preset method is configured to predict rewards based on the state of a certain agent.

在本实施方式的一个实施例中，所述预设方法为时序差分算法。In an embodiment of this implementation manner, the preset method is a sequential difference algorithm.

在本实施方式的一个实施例中，引入目标网络对所述价值网络进行优化更新以使训练过程更加稳定。In an example of this implementation manner, a target network is introduced to optimize and update the value network to make the training process more stable.

在本发明实施方式的第二方面中，提供了一种多智能体行动策略学习装置，包括多个多智能体和多个价值网络，其中，所述多个智能体分别根据各自的初始行动策略采样相应的动作；In a second aspect of the embodiments of the present invention, there is provided a multi-agent action strategy learning device, including a plurality of multi-agents and a plurality of value networks, wherein the plurality of agents are based on their respective initial action strategies Sampling the corresponding action;

所述多个价值网络分别估计所述多智能体执行相应动作后获得的优势；The plurality of value networks respectively estimate the advantages obtained by the multi-agents after performing the corresponding actions;

所述多个智能体基于执行相应动作后获得的优势更新各自的行动策略，以使更新后的各个行动策略能够使得相应的智能体获得更高的回报。The multiple agents update their respective action strategies based on the advantages obtained after performing the corresponding actions, so that the updated respective action strategies can enable the corresponding agents to obtain higher returns.

在本实施方式的一个实施例中，所述装置还包括：In an example of this implementation manner, the apparatus further includes:

预训练模块，被配置为使用真实行动数据进行预训练以分别得到所述多智能体各自的初始行动策略。The pre-training module is configured to perform pre-training using real action data to obtain respective initial action strategies of the multi-agents.

在本实施方式的一个实施例中，所述预训练模块，被配置为采用加权逻辑回归法基于真实行动数据进行预训练以分别得到所述多智能体各自的初始行动策略。In an embodiment of this implementation manner, the pre-training module is configured to perform pre-training based on real action data by using a weighted logistic regression method to obtain respective initial action strategies of the multi-agents.

在本实施方式的一个实施例中，所述装置还包括多个状态编码模块，被配置为分别获取各智能体执行相应动作后所达到的新状态；In an example of this implementation manner, the device further includes a plurality of state encoding modules, configured to separately acquire new states reached after each agent performs a corresponding action;

所述多个价值网络还被配置为分别计算各智能体当前状态和达到新状态后的回报；以及分别根据各智能体当前状态和达到新状态后的回报计算各智能体执行相应动作后获得的优势。The multiple value networks are also configured to calculate the current state of each agent and the reward after reaching the new state, respectively; Advantage.

在本实施方式的一个实施例中，所述用户行动奖励或全部任务完成奖励在行动终止后计算。In an example of this implementation manner, the user action reward or all task completion rewards are calculated after the action is terminated.

在本发明实施方式的第三方面中，提供了一种计算机可读存储介质，所述存储介质存储有计算机程序，所述计算机程序用于执行实施方式的第一方面中任一项所述的方法。In a third aspect of an embodiment of the present invention, a computer-readable storage medium is provided, and the storage medium stores a computer program, and the computer program is used to execute the method described in any one of the first aspect of the embodiment. method.

在本发明实施方式的第四方面中，提供了一种计算设备，所述计算机设备包括处理器，所述处理器用于执行存储器中存储的计算机程序时实现如实施方式的第一方面中任一项所述的方法。In a fourth aspect of an embodiment of the present invention, a computing device is provided, the computer device includes a processor, and the processor is configured to implement any one of the first aspects of the embodiment when executing a computer program stored in a memory method described in item.

本实施方式提出了一种多智能体的行动策略学习方法，在面向任务处理的机器学习场景中，同时训练互相协作的多个智能体(即各个智能体学习各自的行动策略)，而不是采用预先构建的模拟器和智能体进行交互，且无需人工监督，极大的节省了时间成本和资源，另外，本实施方式中，为了使得各个智能体都能学习到优异的行动策略，对每个智能体分配不同的奖励，差异化的奖励分配使得各个智能体能够学习到使得各自累计回报更高的行动策略。This embodiment proposes a multi-agent action strategy learning method. In a task-oriented machine learning scenario, multiple agents that cooperate with each other are simultaneously trained (that is, each agent learns its own action strategy), instead of using The pre-built simulator interacts with the agent without manual supervision, which greatly saves time, cost and resources. In addition, in this embodiment, in order to enable each agent to learn an excellent action strategy, each The agents assign different rewards, and the differentiated reward distribution enables each agent to learn an action strategy that makes its cumulative reward higher.

附图说明Description of drawings

通过参考附图阅读下文的详细描述，本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中，以示例性而非限制性的方式示出了本发明的若干实施方式，其中：The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present invention are shown by way of example and not limitation, wherein:

图1示意性地示出了根据本发明实施方式的多智能体行动策略学习方法的结构场景示意图；FIG. 1 schematically shows a schematic structural scene diagram of a multi-agent action strategy learning method according to an embodiment of the present invention;

图2示意性地示出了根据本发明实施方式的用户与系统交互的流场景示意图；FIG. 2 schematically shows a flow scenario diagram of user interaction with the system according to an embodiment of the present invention;

图3为本发明实施例提供的多智能体行动策略学习方法方法的流程示意图；3 is a schematic flowchart of a method for learning a multi-agent action strategy provided by an embodiment of the present invention;

图4为本发明实施例提供的一种多智能体行动策略学习方法的结构场景示意图；4 is a schematic structural scene diagram of a multi-agent action strategy learning method provided by an embodiment of the present invention;

图5为本发明实施例提供的一种多智能体行动策略学习装置的模块示意图；5 is a schematic block diagram of a multi-agent action strategy learning device according to an embodiment of the present invention;

图6为本发明实施例提供的一种计算机可读存储介质的示意图；6 is a schematic diagram of a computer-readable storage medium according to an embodiment of the present invention;

图7为本发明实施例提供的一种计算设备的示意图；FIG. 7 is a schematic diagram of a computing device according to an embodiment of the present invention;

在附图中，相同或对应的标号表示相同或对应的部分。In the drawings, the same or corresponding reference numerals denote the same or corresponding parts.

具体实施方式Detailed ways

下面将参考若干示例性实施方式来描述本发明的原理和精神。应当理解，给出这些实施方式仅仅是为了使本领域技术人员能够更好地理解进而实现本发明，而并非以任何方式限制本发明的范围。相反，提供这些实施方式是为了使本公开更加透彻和完整，并且能够将本公开的范围完整地传达给本领域的技术人员。The principles and spirit of the present invention will now be described with reference to several exemplary embodiments. It should be understood that these embodiments are only given for those skilled in the art to better understand and implement the present invention, but not to limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

本领域技术人员知道，本发明的实施方式可以实现为一种系统、装置、设备、方法或计算机程序产品。因此，本公开可以具体实现为以下形式，即：完全的硬件、完全的软件(包括固件、驻留软件、微代码等)，或者硬件和软件结合的形式。As will be appreciated by those skilled in the art, embodiments of the present invention may be implemented as a system, apparatus, device, method or computer program product. Accordingly, the present disclosure may be embodied in entirely hardware, entirely software (including firmware, resident software, microcode, etc.), or a combination of hardware and software.

根据本发明的实施方式，提出了一种多智能体行动策略学习方法、装置、介质和计算设备。According to an embodiment of the present invention, a multi-agent action strategy learning method, apparatus, medium and computing device are provided.

此外，附图中的任何元素数量均用于示例而非限制，以及任何命名都仅用于区分，而不具有任何限制含义。Furthermore, any number of elements in the drawings is for illustration and not limitation, and any designation is for distinction only and does not have any limiting meaning.

下面参考本发明的若干代表性实施方式，详细阐释本发明的原理和精神。The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the present invention.

场景概览Scenario overview

在面向任务的场景中，用户往往带着明确的目的而来，希望得到满足特定限制条件的信息或服务，例如：订餐，订票，线上购物，预约出租车，预定酒店，寻找音乐、电影或某种产品等。在以上所述的场景中，往往要求系统(智能体)能够根据用户的目标提供相应的信息。当前的方式往往时通过预先构建的用户模拟器与系统(智能体)进行交互，以对系统(智能体)进行训练，使其具备能够服务用户的能力。In task-oriented scenarios, users often come with a clear purpose, hoping to get information or services that meet specific constraints, such as: ordering meals, booking tickets, online shopping, booking taxis, booking hotels, looking for music, movies or some kind of product. In the above-mentioned scenarios, the system (agent) is often required to provide corresponding information according to the user's goal. The current method often interacts with the system (agent) through a pre-built user simulator to train the system (agent) to have the ability to serve users.

本发明人发现，可以基于演员评论家(Actor-Critic)框架同时训练不同角色的对话智能体(Agent)，使其学习交互协作以实现目标的策略。例如，在上述框架下，智能体被视为演员(Actor)，其根据当前的状态选择动作行动，评论家(Critic)模块对演员的表现(智能体执行某一动作)打分，演员根据评论家的打分调整自己的动作以期获得更高的打分。The inventors have found that dialogue agents (Agents) with different roles can be trained simultaneously based on the Actor-Critic framework, so that they can learn strategies for interacting and collaborating to achieve goals. For example, in the above framework, the agent is regarded as an actor (Actor), which selects actions according to the current state, the critic (Critic) module scores the performance of the actor (the agent performs a certain action), and the actor is based on the critic (Critic) module. The score adjusts his actions in order to get a higher score.

但是，由于不同智能体的角色其实是不同的，即智能体分为用户智能体和系统智能体，基于此，采用同一套评判标准评判智能体的表现是不合适的，需要对不同的智能体分别采用对应的标准打分，即需要采用不同评判标准的评论家(Critic)分别对智能体(Actor)的表现打分，如图1所示，以使智能体能够分别调整自己的行动策略，从而更有效率的完成任务。However, since the roles of different agents are actually different, that is, agents are divided into user agents and system agents. Based on this, it is inappropriate to use the same set of evaluation criteria to judge the performance of agents, and it is necessary to evaluate the performance of different agents. The corresponding standards are used to score, that is, the critics (Critic) who need to use different evaluation standards to score the performance of the agent (Actor) respectively, as shown in Figure 1, so that the agent can adjust its own action strategy, so as to improve the performance of the agent. Complete tasks efficiently.

任务概述Mission overview

本发明的实施例提供了一种多智能体行动策略学习方法，所述多智能体交互协作以完成任务目标，所述任务目标可以是多智能体共同的任务目标，也可以是其中某一智能体的任务目标，所述多智能体可以被实现为各种能够实施行为或动作的虚/实体，例如能够进行对话/抓取物品/移动的机器人，下面以多智能体进行对话的场景为例进行说明，所述对话包括用户智能体(下文中简称为用户)与系统智能体(下文中简称为系统)之间的多轮对话，所述用户智能体具有任务目标G＝(C，R)，其中C为约束条件(例如，城市中心的日本餐馆)，R为需求类型(例如，查询酒店的地址，电话号码)，用户智能体所请求的信息均存储在系统智能体能够访问的外部数据库DB中，用户智能体和系统智能体在对话过程中交互协作以实现用户目标。G中可以有多个域，并且两个智能体必须完成每个域中的所有子任务。两个智能体都被限制为只能感知部分环境信息，如图2所示，即只有用户智能体知晓任务目标G，而只有系统智能体才能访问DB，两者了解彼此信息的唯一方法是通过对话交互。在本发明中，对话过程中的两个智能体是异步通信的。即用户智能体和系统智能体交替通信。The embodiment of the present invention provides a multi-agent action strategy learning method, the multi-agent interacts and cooperates to complete a task goal, and the task goal may be a common task goal of the multi-agents, or may be one of the intelligent The multi-agent can be realized as various virtual/entities capable of implementing behaviors or actions, such as a robot capable of dialogue/grabbing/moving. The following takes the scenario of multi-agent dialogue as an example To illustrate, the dialogue includes multiple rounds of dialogue between a user agent (hereinafter referred to as a user) and a system agent (hereinafter referred to as a system), and the user agent has a task goal G=(C, R) , where C is the constraint condition (eg, Japanese restaurant in the city center), R is the demand type (eg, query the hotel's address, phone number), and the information requested by the user agent is stored in an external database that the system agent can access In DB, the user agent and the system agent interact and cooperate in the dialogue process to achieve user goals. There can be multiple domains in G, and two agents must complete all subtasks in each domain. Both agents are limited to only perceive part of the environmental information, as shown in Figure 2, that is, only the user agent knows the task goal G, and only the system agent can access DB. The only way for the two to know each other's information is through Conversational interaction. In the present invention, the two agents in the dialogue process communicate asynchronously. That is, the user agent and the system agent communicate alternately.

示例性方法Exemplary method

下面结合图1的应用场景，参考图3来描述根据本发明示例性实施方式的多智能体行动策略学习方法。需要注意的是，上述应用场景仅是为了便于理解本发明的精神和原理而示出，本发明的实施方式在此方面不受任何限制。相反，本发明的实施方式可以应用于适用的任何场景。The following describes a multi-agent action policy learning method according to an exemplary embodiment of the present invention with reference to FIG. 3 in conjunction with the application scenario of FIG. 1 . It should be noted that the above application scenarios are only shown for the convenience of understanding the spirit and principle of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention can be applied to any scenario where applicable.

图3示意性地示出了根据本申请一实施例的多智能体行动策略学习方法的一种示例性的处理流程300。处理流程300开始后，执行步骤S310。FIG. 3 schematically shows an exemplary processing flow 300 of a multi-agent action policy learning method according to an embodiment of the present application. After the process flow 300 starts, step S310 is executed.

步骤S310，所述多智能体分别根据各自的初始行动策略采样相应的动作；Step S310, the multi-agent samples corresponding actions according to their respective initial action strategies;

其中每个智能体都有自己的状态

和动作

状态s＝(s₁，...，s_N)→s′＝(s′₁，...s′_N)，其中的状态转换取决于所有智能体依据各自的行动策略π_i(a_i|s_i)所采取的动作(a₁，...，a_N)，其中

在本实施例中，由用户首先行动，开启任务，即用户首先根据行动策略(μ(a^U|s^U))基于当前的状态选择待执行的动作，进而由系统根据行动策略(π(a^S|s^S))基于当前的状态选择待执行的动作，基于此，每一轮对话都可以采用状态-动作对表示：where each agent has its own state

and action

State _s ₌ ₍ _s ₁ , _. |s _i ) the action taken (a ₁ , . . . , a _N ), where

In this embodiment, the user first acts to start the task, that is, the user first selects the action to be performed based on the current state according to the action strategy (μ(a ^U |s ^U )), and then the system selects the action to be performed according to the action strategy (π(a U |s U )). ^S | s ^S )) selects the action to be performed based on the current state. Based on this, each round of dialogue can be represented by a state-action pair:

其中，上标U表示用户，上标S表示系统，下标表示某一轮行动。Among them, the superscript U represents the user, the superscript S represents the system, and the subscript represents a certain round of actions.

在本实施方式的一个实施例中，各个智能体的初始行动策略可以是随机策略，即各个智能体基于当前状态从可选动作集中随机采样待执行的动作。考虑到，在处理多领域，多目标的复杂(对话)任务时，行动各个智能体的行动策略的动作空间可能非常大，这种情况下，采用随机策略，从头开始训练会耗费极大资源和时间。因此，在本实施方式的一个实施例中，训练过程可以分为两个阶段：首先使用真实行动数据(例如对话语料库)对各智能体的行动策略进行预训练，然后再执行步骤S310，使用RL改进预训练的策略。由于每个智能体在一轮对话中仅生成几个对话动作，在本实施方式的一个实施例中，使用β加权逻辑回归进行策略预训练，以减少数据样本偏差：In an example of this implementation manner, the initial action strategy of each agent may be a random strategy, that is, each agent randomly samples actions to be performed from a set of optional actions based on the current state. Considering that when dealing with multi-domain, multi-objective complex (dialogue) tasks, the action space of each agent's action strategy may be very large. In this case, using a random strategy and training from scratch will consume a lot of resources and resources. time. Therefore, in an example of this implementation manner, the training process can be divided into two stages: first, the action strategy of each agent is pre-trained by using real action data (such as a dialogue corpus), and then step S310 is executed, using RL Improve pre-trained strategies. Since each agent only generates a few dialogue actions in a round of dialogue, in an example of this implementation, β-weighted logistic regression is used to perform policy pre-training to reduce the data sample bias:

L(X，Y；β)＝-[β·Y^T logσ(X)+(I-Y)^T log(I-σ(X))]，L(X, Y; β)=-[β·Y ^T logσ(X)+(IY) ^T log(I-σ(X))],

其中X是状态，Y是该任务中语料库的动作。where X is the state and Y is the action of the corpus in that task.

此外，用户与系统之间可能涉及多轮对话，其中的每轮对话是指，每一轮次的系统对话内容与用户对话内容。In addition, there may be multiple rounds of dialogue between the user and the system, where each round of dialogue refers to the dialogue content of the system and the dialogue content of the user in each round.

例如，用S(t)表示第t轮次的系统对话内容，用U(t)表示第t轮次的用户对话内容，其中t表示轮次序数，t＝1,2,3,…。如S(1)表示第一轮次的系统对话内容，U(1)表示第一轮次的系统对话内容，等等。For example, use S(t) to represent the system dialogue content of the tth round, and U(t) to represent the user dialogue content of the tth round, where t represents the round sequence number, t=1, 2, 3, . . . For example, S(1) represents the system dialog content of the first round, U(1) represents the system dialog content of the first round, and so on.

应当注意的是，在每一轮对话中，系统对话内容的发言时间在用户对话内容的发言时间之后。即在单轮对话中，用户首先发布查询，系统给出对应的响应。It should be noted that in each round of dialogue, the speaking time of the system dialogue content is after the speaking time of the user dialogue content. That is, in a single round of dialogue, the user first issues a query, and the system gives a corresponding response.

在本实施例中，系统的行动策略π根据系统的状态s^S决定系统将要执行的动作a^S，以给予用户适当的响应。每个系统动作a^S都是动作集A的子集。在本实施例中，第t轮对话的系统状态

由以下组成：(I)本轮对话的用户操作

(II)上一轮对话的系统动作

(III)跟踪用户提供的约束槽和请求槽的置信状态b_t；以及(IV)来自DB的查询结果数量的嵌入向量qt。In this embodiment, the action strategy π of the system determines the action a ^S to be performed by the system according to the state s ^S of the system, so as to give the user an appropriate response. Each system action a ^S is a subset of the action set A. In this embodiment, the system state of the t-th round of dialogue

It consists of the following: (I) User operation of this round of dialogue

(II) System actions of the previous round of dialogue

(III) Confidence states _bt that track user-supplied constraint and request slots; and (IV) an embedding vector qt of the number of query results from the DB.

用户策略μ根据用户状态s^U来决定用户将要执行的动作a^U，以将其约束条件和需求传达给系统。在本实施例中，用户状态

由以下组成：(I)上一轮对话的系统动作

(II)上一轮对话的用户动作

(III)目标状态g_t，表示需要发送的剩余约束和需求类型；(IV)差异向量ct，指示系统响应和约束条件C之间的差异信息。The user policy μ decides the action a ^U will perform by the user according to the user state s ^U to communicate its constraints and requirements to the system. In this embodiment, the user state

It consists of the following: (I) System actions of the previous round of dialogue

(II) User action of the previous round of dialogue

(III) Target state g _t , indicating the remaining constraints and demand types that need to be sent; (IV) Difference vector ct , indicating the difference information between the system response and the constraint condition C.

用户的动作是意图(约束条件和需求类型)的抽象表示，可以用领域，需求类型，插槽类型和槽值(例如[餐厅，通知需求，食物，意大利])组成的四元组表示。上述示例表示用户希望系统告知有意大利菜的餐厅的信息。需要说明的是，一轮对话中可能有多个意图。A user's action is an abstract representation of an intent (constraint and requirement type), which can be represented by a quadruple of domain, requirement type, slot type, and slot value (eg [restaurant, notification requirement, food, italian]). The above example indicates that the user wants the system to inform about restaurants with Italian dishes. To be clear, there may be multiple intents in a round of conversation.

除了预测用户动作外，用户策略还被配置为输出行动终止信号T，即μ＝μ(a^U,T|s^U)。即在用户没有剩余的任务需要执行时，输出所述终止信号T。In addition to predicting user actions, the user policy is also configured to output an action termination signal T, ie μ=μ(a ^U , T|s ^U ). That is, when the user has no remaining tasks to perform, the termination signal T is output.

在获取了每个智能体的当前状态及其待执行的动作后，即可估计其执行动作之后的新状态，进而计算其执行动作后所取得的优势，以更新各个智能体的行动策略。After obtaining the current state of each agent and the action to be performed, the new state after the action can be estimated, and then the advantage obtained after the action is calculated to update the action strategy of each agent.

作为示例，在对话过程中，用户和系统之间已经开始进行对话，为了更新各个智能体的行动策略，接下来执行步骤S320。As an example, in the dialog process, the user and the system have started a dialog. In order to update the action strategy of each agent, step S320 is executed next.

在步骤S320，分别估计所述多智能体执行相应动作后获得的优势；In step S320, the advantages obtained by the multi-agents after performing the corresponding actions are estimated respectively;

具体的，首先分别获取各智能体执行相应动作后所达到的新状态；Specifically, first obtain the new states reached by each agent after performing the corresponding action;

然后，分别计算各智能体当前状态和达到新状态后的回报；Then, calculate the current state of each agent and the return after reaching the new state;

最后，分别根据各智能体当前状态和达到新状态后的回报计算各智能体执行相应动作后获得的优势。Finally, according to the current state of each agent and the reward after reaching the new state, the advantages obtained by each agent after performing the corresponding action are calculated.

其中，某一智能体的当前状态的回报为其获得的奖励的累积，具体来讲，在强化学习过程中，智能体每次行动后，环境会反馈给智能体一个奖励值，作为一个示例，某一智能体第t轮对话的预期回报

其中r为环境在所述智能体选择某一动作后反馈的奖励，γ为折扣因子。Among them, the reward of the current state of an agent is the accumulation of the reward obtained. Specifically, in the process of reinforcement learning, after each action of the agent, the environment will feed back a reward value to the agent. As an example, The expected return of an agent's t-round dialogue

where r is the reward that the environment feeds back after the agent chooses an action, and γ is the discount factor.

在传统的强化学习中，往往只存在一个需要学习策略的智能体，也就是说环境只需要给所述的一个智能体反馈奖励。考虑到，一方面，用户和系统的角色并不相同。用户会主动启动任务，并可能在对话过程中更改任务，而系统只会被动地响应用户并返回适当的信息，即用户和系统的表现不能完全采用统一标准进行评判，因此应为每个智能体单独计算奖励。在本实施方式的一个实施例中，系统奖励

至少由以下部分组成：In traditional reinforcement learning, there is often only one agent that needs to learn a strategy, that is to say, the environment only needs to give feedback to the one agent. Consider, on the one hand, that the roles of users and systems are not the same. The user will actively initiate the task and may change the task during the dialogue, while the system will only passively respond to the user and return appropriate information, that is, the performance of the user and the system cannot be completely judged by a unified standard, so each agent should be Rewards are calculated separately. In one example of this implementation, the system rewards

It consists of at least the following parts:

(I)无行动的惩罚；举例来讲，在一轮对话中，系统没有给出任何回应

的惩罚；(I) Penalties for inaction; for example, in a round of dialogue, the system does not give any response

punishment;

(II)用户智能体请求协助时未立即提供相应协助的惩罚；举例来讲，如果一轮对话中，用户发送了请求，但系统没有及时回复相应的信息，则给予惩罚；(II) Penalty for not immediately providing corresponding assistance when the user agent requests assistance; for example, if the user sends a request in a round of dialogue, but the system does not respond to the corresponding information in time, it will be punished;

(III)子任务完成奖励；即根据用户发送的请求帮助用户完成子任务的奖励。举例来讲，在一轮对话中，如果用户请求告知某项信息，系统给予了用户恰当的回应，则反馈系统对应的奖励。(III) Subtask completion reward; that is, the reward for helping the user complete the subtask according to the request sent by the user. For example, in a round of dialogue, if the user requests to inform a certain information, the system gives the user an appropriate response, and the corresponding reward is fed back to the system.

用户奖励

至少由以下部分组成：User reward

It consists of at least the following parts:

(I)无行动的惩罚；与系统奖励类似，即

(I) Penalties for inaction; similar to system rewards, i.e.

(II)存在未完成子任务时执行新子任务的惩罚；即当仍然有一个限制条件要告知系统时，如果用户请求新的信息，则给予惩罚；(II) There is a penalty for executing a new subtask when the subtask is not completed; that is, when there is still a constraint to inform the system, if the user requests new information, a penalty is given;

(III)用户行动奖励，基于用户智能体是否执行所有利于完成任务的行动确定，举例来讲，在本实施中，根据用户是否告知系统所有约束条件C和需求类型R确定。(III) The user action reward is determined based on whether the user agent performs all actions that are beneficial to completing the task. For example, in this implementation, it is determined based on whether the user informs the system of all constraints C and requirement types R.

基于以上方式，在一轮对话中，系统奖励和用户奖励可以基于各个智能体的状态很明确的得到。Based on the above methods, in a round of dialogue, system rewards and user rewards can be clearly obtained based on the state of each agent.

但是，考虑到另一方面，两个智能体进行通信和协作以完成相同的任务，因此奖励也应涉及两个智能体的(共同)全局目标。However, considering that on the other hand, two agents communicate and collaborate to accomplish the same task, the reward should also involve the (common) global goal of both agents.

基于此，在本实施方式的一个实施例中，全局奖励

至少由以下部分组成：Based on this, in one example of this implementation, the global reward

It consists of at least the following parts:

(I)效率损失(惩罚)；举例来讲，每轮对话都会给予预设范围内的惩罚(例如较小的负值)；(I) Efficiency loss (penalty); for example, each round of dialogue will give a penalty within a preset range (such as a small negative value);

(II)子任务完成奖励；举例来讲，完成用户整体目标G中某一领域中的子任务后，给予奖励；(II) Subtask completion reward; for example, after completing a subtask in a certain field in the user's overall goal G, a reward is given;

(III)全部任务完成奖励；举例来讲，基于用户整体目标G，完成所有任务的奖励。(III) All task completion rewards; for example, based on the user's overall goal G, the rewards for completing all tasks.

以上对奖励的详细分解仅仅针对于本实施例，在采用本申请的方法进行其他领域的策略学习时，并不需要完全遵循以上方式进行设置和计算，总的来说，本发明的以上实施例中仅仅是为了阐释如何针对于每一个智能体单独计算回报(即独立计算奖励并累计)，以及针对于需要协作的多智能体策略学习，在针对于每一个智能体单独计算回报的基础上，额外计算全局回报(累计全局奖励)，并且按照预设的方式分配给相应的智能体，例如可以将全局回报按预设的比例/权重分配给各个智能体。The above detailed decomposition of the reward is only for this embodiment. When using the method of the present application for strategy learning in other fields, it is not necessary to completely follow the above method for setting and calculation. In general, the above embodiments of the present invention It is only to explain how to calculate the reward separately for each agent (that is, independently calculate the reward and accumulate it), and for multi-agent policy learning that requires cooperation, on the basis of calculating the reward separately for each agent, Additional global rewards (accumulated global rewards) are calculated and distributed to corresponding agents in a preset manner. For example, the global rewards can be distributed to each agent according to a preset ratio/weight.

在本实施例中，某一智能体的回报为其本身的回报和全局回报，例如，针对于系统而言，其总回报为累积的系统奖励和全局奖励：

(上标G表示全局)。需要注意的是，全部任务成功和用户行动奖励仅全部任务结束后计算，并且系统奖励中计算的子任务完成奖励与全局奖励中的子任务完成奖励不同。In this embodiment, the reward of a certain agent is its own reward and global reward. For example, for the system, its total reward is the accumulated system reward and global reward:

(Superscript G indicates global). It should be noted that all task success and user action rewards are only calculated after all tasks are completed, and the subtask completion reward calculated in the system reward is different from the subtask completion reward in the global reward.

每个智能体的行动策略均旨在使其对应的智能体获得的回报最大化。为了使得各个智能体学习到的行动策略为最优行动策略，接下来执行步骤S330，在步骤S330，基于所述多智能体执行相应动作后获得的优势更新各智能体的行动策略，以使更新后的各个行动策略能够使得相应的智能体获得更高的回报。具体来讲，在本实施方式的一个实施例中，基于智能体执行某一动作后的优势更新其行动策略，所述优势基于回报计算，即A(s)＝r+γV(s’)-V(s)，在得到各个不同部分的优势之后，针对于某一智能体而言，基于与其相关的各个部分优势，即可对其行动策略进行更新，作为示例，系统策略和用户策略可以基于以下方式更新：The action policy of each agent is designed to maximize the reward obtained by its corresponding agent. In order to make the action strategy learned by each agent to be the optimal action strategy, step S330 is executed next. In step S330, the action strategy of each agent is updated based on the advantages obtained after the multi-agent performs the corresponding action, so that the update Each subsequent action strategy can make the corresponding agent obtain higher rewards. Specifically, in an example of this implementation, the action strategy of the agent is updated based on the advantage after the agent performs a certain action, and the advantage is calculated based on the reward, that is, A(s)=r+γV(s')- V(s), after obtaining the advantages of different parts, for an agent, its action strategy can be updated based on the relative advantages of each part. As an example, the system strategy and user strategy can be based on Update the following way:

通过以上方式可以分别得到用户和系统的策略梯度，基于此即可采用相应的方式更新行动策略，以得到各个智能体的最优行动策略，本实施例中，系统策略

由

参数化，而用户策略μ_w由w参数化，即基于以上策略梯度分别更新参数

和w。Through the above methods, the strategy gradients of the user and the system can be obtained respectively. Based on this, the action strategy can be updated in a corresponding manner to obtain the optimal action strategy of each agent. In this embodiment, the system strategy

Depend on

parameterized, and the user policy μw is parameterized by _w , that is, the parameters are updated respectively based on the above policy gradients

and w.

以上实施例说明了各个智能体的行动策略如何进行更新，即根据智能体在某一状态执行某一动作的优势更新，基于此，我们可以看出，优势的计算同样重要，即恰当的优势计算方法有助于智能体学到更加优异的行动策略，具体的，在本实施方式的一个实施例中，分别对应不同的智能体设置价值网络(即上文所述的Critic)，另外，为了完整的累积不同智能体的奖励，额外设置单独的价值网络(Critic)以估算全局回报，综上，如图4所示，本实施例中设置一个包括了三个独立的价值网络(Critic)的混合价值网络HVN(Hybrid ValueNetwork)分别基于相应部分的奖励，累积回报，计算优势。在以上实施例的基础上，本实施例的价值网络(Critic)分别基于各个智能体的状态累计回报，具体而言，在本实施例中，首先对每个智能体的对话状态进行编码，以学习状态表示：The above embodiment illustrates how the action strategy of each agent is updated, that is, the update is based on the advantage of the agent performing a certain action in a certain state. Based on this, we can see that the calculation of the advantage is equally important, that is, the appropriate calculation of the advantage The method helps the agent to learn a more excellent action strategy. Specifically, in an example of this implementation manner, the value network (that is, the above-mentioned Critic) is set corresponding to different agents. In addition, in order to complete the Accumulate the rewards of different agents, and additionally set up a separate value network (Critic) to estimate the global return. To sum up, as shown in Figure 4, in this embodiment, a mixture of three independent value networks (Critic) is set up. The value network HVN (Hybrid ValueNetwork) is based on the corresponding part of the reward, the cumulative return, and the calculation advantage. On the basis of the above embodiment, the value network (Critic) of this embodiment is based on the cumulative rewards of the states of each agent. Specifically, in this embodiment, the dialogue state of each agent is first encoded to Learning state means:

其中f(·)可以是任何神经网络(例如多层感知机、卷积神经网络等，以上两个神经网络可以是相同的神经网络可以是相同的，也可以是不同的，本实施方式对此并不限定)，tanh为双曲正切函数(激活函数)。接下来即可根据各个智能体的对话状态累积奖励计算回报：where f( ) can be any neural network (for example, a multilayer perceptron, a convolutional neural network, etc., and the above two neural networks can be the same neural network, or the same or different. Not limited), tanh is a hyperbolic tangent function (activation function). Next, the reward can be calculated according to the cumulative reward of each agent's dialogue state:

其中，f_S、f_U和f_G是三个任意的神经网络。where f _S , f _U and f _G are three arbitrary neural networks.

需要说明的是，以上实施例中所述的神经网络可以是相同的，也可以是不同的，本实施方式对此并不限定。It should be noted that, the neural networks described in the above embodiments may be the same or different, which are not limited in this embodiment.

在本申请的一个实施方式中，优势的计算方法同样在训练过程中被不断更新优化，具体的，在所述多智能体的策略学习过程中，所述价值网络被配置为以最小化采用预设方法预测的回报与所述价值网络计算的回报之间的方差为目标进行优化更新。所述预设方法可以是基于某一智能体的状态预测回报的任意已知方式，例如时序差分算法。考虑到价值网络的更新频率较高，这会导致估计值的变化过大，特别是针对于本实施例中的多智能体的策略学习，因此，在本实施方式的一个实施例中，引入目标网络对所述价值网络进行优化更新以使训练过程更加稳定，具体的，可以采用以下损失函数分别进行优化，以更新价值网络：In an embodiment of the present application, the calculation method of the advantage is also continuously updated and optimized in the training process. Specifically, in the multi-agent strategy learning process, the value network is configured to minimize the use of pre- Optimization updates are made targeting the variance between the return predicted by the method and the return calculated by the value network. The preset method may be any known method for predicting rewards based on the state of a certain agent, such as a time series difference algorithm. Considering that the update frequency of the value network is high, this will cause the estimated value to change too much, especially for the multi-agent policy learning in this embodiment. Therefore, in an example of this implementation, the target The network optimizes and updates the value network to make the training process more stable. Specifically, the following loss functions can be used for optimization respectively to update the value network:

其中，HVNV_θ通过θ参数化，θ^-是目标网络的权重，总损失L_V是每个分量奖励的估计回报损失之和。where HVNV _θ is parameterized by θ, θ ^- is the weight of the target network, and the total loss _LV is the sum of the estimated return losses for each component reward.

本实施方式提出了一种多智能体的行动策略学习方法，在面向任务处理的机器学习场景中，同时训练互相协作的多个智能体(即各个智能体学习各自的行动策略)，而不是采用预先构建的模拟器和智能体进行交互，且无需人工监督，极大的节省了时间成本和资源，另外，本实施方式中，为了使得各个智能体学习到优异的行动策略，同时构建了混合价值网络，将总奖励对应于各个智能体和任务划分，使得每个智能体的表现被不同的价值网络评判，差异化的奖励分配使得各个智能体能够学习到使得各自累计回报更高的行动策略。This embodiment proposes a multi-agent action strategy learning method. In a task-oriented machine learning scenario, multiple agents that cooperate with each other are simultaneously trained (that is, each agent learns its own action strategy), instead of using The pre-built simulator interacts with the agent without manual supervision, which greatly saves time, cost and resources. In addition, in this embodiment, in order to enable each agent to learn an excellent action strategy, a mixed value is constructed at the same time. The network divides the total reward corresponding to each agent and task, so that the performance of each agent is judged by different value networks, and the differentiated reward distribution enables each agent to learn action strategies that make their cumulative returns higher.

示例性装置Exemplary device

在介绍了本发明示例性实施方式的方法之后，接下来，参考图5对本发明示例性实施方式的提供了一种多智能体行动策略学习装置，包括多个多智能体和多个价值网络，其中，所述多个智能体分别根据各自的初始行动策略采样相应的动作；After introducing the method of the exemplary embodiment of the present invention, next, referring to FIG. 5, a multi-agent action strategy learning apparatus is provided for the exemplary embodiment of the present invention, including multiple multi-agents and multiple value networks, Wherein, the multiple agents sample corresponding actions according to their respective initial action strategies;

需要说明的是，智能体的数量和价值网络的数量并不是固定的，可以根据实际的场景需要进行设置，例如若存在若干智能体的角色一致(即可以使用同一个价值网络)，则不需要分别对应每一个智能体均设置一个价值网络，另外除了对于不同角色的智能体需要设置不同的价值网络外，还可以额外设置一个用于计算全局回报的价值网络。以上价值网络可以混合设置，组成混合价值网络。It should be noted that the number of agents and the number of value networks are not fixed, and can be set according to actual scene needs. For example, if there are several agents with the same roles (that is, the same value network can be used), no A value network is set corresponding to each agent, and in addition to different value networks for agents with different roles, an additional value network for calculating global returns can be set. The above value networks can be mixed to form a mixed value network.

示例性介质Exemplary Media

在介绍了本发明示例性实施方式的方法、装置之后，接下来，参考图6对本发明示例性实施方式的计算机可读存储介质进行说明。After introducing the method and apparatus of the exemplary embodiment of the present invention, next, the computer-readable storage medium of the exemplary embodiment of the present invention will be described with reference to FIG. 6 .

请参考图6，其示出的计算机可读存储介质为光盘60，其上存储有计算机程序(即程序产品)，所述计算机程序在被处理器运行时，会实现上述方法实施方式中所记载的各步骤，例如：所述多智能体分别根据各自的初始行动策略采样相应的动作；分别估计所述多智能体执行相应动作后获得的优势；基于所述多智能体执行相应动作后获得的优势更新各智能体的行动策略，以使更新后的各个行动策略能够使得相应的智能体获得更高的回报。各步骤的具体实现方式在此不再重复说明。Please refer to FIG. 6 , the computer-readable storage medium shown is an optical disc 60, on which a computer program (ie, a program product) is stored, and when the computer program is run by the processor, the computer program described in the above-mentioned method embodiments will be implemented. For example: the multi-agent samples corresponding actions according to their respective initial action strategies; respectively estimates the advantages obtained by the multi-agents after performing the corresponding actions; based on the multi-agents performing the corresponding actions The advantage updates the action strategy of each agent, so that the updated action strategy can make the corresponding agent obtain higher returns. The specific implementation manner of each step is not repeated here.

需要说明的是，所述计算机可读存储介质的例子还可以包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他光学、磁性存储介质，在此不再一一赘述。It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random Access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other optical and magnetic storage media will not be repeated here.

示例性计算设备Exemplary Computing Device

在介绍了本发明示例性实施方式的方法、装置和介质之后，接下来，参考图7对本发明示例性实施方式的计算设备进行说明，图7示出了适于用来实现本发明实施方式的示例性计算设备70的框图，该计算设备70可以是计算机系统或服务器。图7显示的计算设备70仅仅是一个示例，不应对本发明实施例的功能和使用范围带来任何限制。After introducing the method, apparatus, and medium of the exemplary embodiment of the present invention, next, the computing device of the exemplary embodiment of the present invention will be described with reference to FIG. A block diagram of an exemplary computing device 70, which may be a computer system or server. The computing device 70 shown in FIG. 7 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present invention.

如图7所示，计算设备70的组件可以包括但不限于：一个或者多个处理器或者处理单元701，系统存储器702，连接不同系统组件(包括系统存储器702和处理单元701)的总线703。As shown in FIG. 7, components of computing device 70 may include, but are not limited to, one or more processors or processing units 701, system memory 702, and a bus 703 connecting different system components (including system memory 702 and processing unit 701).

计算设备70典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算设备70访问的可用介质，包括易失性和非易失性介质，可移动的和不可移动的介质。Computing device 70 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computing device 70, including both volatile and nonvolatile media, removable and non-removable media.

系统存储器702可以包括易失性存储器形式的计算机系统可读介质，例如随机存取存储器(RAM)7021和/或高速缓存存储器7022。计算设备70可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例，ROM7023可以用于读写不可移动的、非易失性磁介质(图7中未显示，通常称为“硬盘驱动器”)。尽管未在图7中示出，可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器，以及对可移动非易失性光盘(例如CD-ROM，DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下，每个驱动器可以通过一个或者多个数据介质接口与总线703相连。系统存储器702中可以包括至少一个程序产品，该程序产品具有一组(例如至少一个)程序模块，这些程序模块被配置以执行本发明各实施例的功能。System memory 702 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 7021 and/or cache memory 7022 . Computing device 70 may further include other removable/non-removable, volatile/non-volatile computer system storage media. For example only, the ROM 7023 may be used to read and write to a non-removable, non-volatile magnetic medium (not shown in Figure 7, commonly referred to as a "hard disk drive"). Although not shown in Figure 7, a disk drive may be provided for reading and writing to removable non-volatile magnetic disks (eg "floppy disks"), as well as removable non-volatile optical disks (eg CD-ROM, DVD- ROM or other optical media) to read and write optical drives. In these cases, each drive may be connected to bus 703 through one or more data media interfaces. System memory 702 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present invention.

具有一组(至少一个)程序模块7024的程序/实用工具7025，可以存储在例如系统存储器702中，且这样的程序模块7024包括但不限于：操作系统、一个或者多个应用程序、其它程序模块以及程序数据，这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块7024通常执行本发明所描述的实施例中的功能和/或方法。A program/utility 7025 having a set (at least one) of program modules 7024, which may be stored, for example, in system memory 702, and such program modules 7024 include, but are not limited to: an operating system, one or more application programs, other program modules As well as program data, each or some combination of these examples may include an implementation of a network environment. Program modules 7024 generally perform the functions and/or methods in the described embodiments of the present invention.

计算设备70也可以与一个或多个外部设备704(如键盘、指向设备、显示器等)通信。这种通信可以通过输入/输出(I/O)接口705进行。并且，计算设备70还可以通过网络适配器706与一个或者多个网络(例如局域网(LAN)，广域网(WAN)和/或公共网络，例如因特网)通信。如图7所示，网络适配器706通过总线703与计算设备70的其它模块(如处理单元701等)通信。应当明白，尽管图7中未示出，可以结合计算设备70使用其它硬件和/或软件模块。Computing device 70 may also communicate with one or more external devices 704 (eg, keyboards, pointing devices, displays, etc.). Such communication may take place through input/output (I/O) interface 705 . Also, the computing device 70 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 706 . As shown in FIG. 7 , network adapter 706 communicates with other modules of computing device 70 (eg, processing unit 701 , etc.) through bus 703 . It should be appreciated that, although not shown in FIG. 7 , other hardware and/or software modules may be used in conjunction with computing device 70 .

处理单元701通过运行存储在系统存储器702中的程序，从而执行各种功能应用以及数据处理，例如，获取至少一个第一监测量的监测数据和第二监测量的监测数据，其中，所述多智能体分别根据各自的初始行动策略采样相应的动作；分别估计所述多智能体执行相应动作后获得的优势；基于所述多智能体执行相应动作后获得的优势更新各智能体的行动策略，以使更新后的各个行动策略能够使得相应的智能体获得更高的回报。The processing unit 701 executes various functional applications and data processing by running the programs stored in the system memory 702, for example, acquiring at least one monitoring data of the first monitoring quantity and monitoring data of the second monitoring quantity, wherein the multiple The agents sample corresponding actions according to their respective initial action strategies; respectively estimate the advantages obtained by the multi-agents after performing the corresponding actions; update the action strategies of the agents based on the advantages obtained after the multi-agents perform the corresponding actions, So that the updated action strategies can make the corresponding agents obtain higher rewards.

应当注意，尽管在上文详细描述中提及了基于时间序列的异常数据检测装置的若干单元/模块或子单元/模块，但是这种划分仅仅是示例性的并非强制性的。实际上，根据本发明的实施方式，上文描述的两个或更多单元/模块的特征和功能可以在一个单元/模块中具体化。反之，上文描述的一个单元/模块的特征和功能可以进一步划分为由多个单元/模块来具体化。It should be noted that although several units/modules or sub-units/modules of the time-series based abnormal data detection apparatus are mentioned in the above detailed description, this division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units/modules described above may be embodied in one unit/module according to embodiments of the present invention. Conversely, the features and functions of one unit/module described above may be further subdivided to be embodied by multiple units/modules.

此外，尽管在附图中以特定顺序描述了本发明方法的操作，但是，这并非要求或者暗示必须按照该特定顺序来执行这些操作，或是必须执行全部所示的操作才能实现期望的结果。附加地或备选地，可以省略某些步骤，将多个步骤合并为一个步骤执行，和/或将一个步骤分解为多个步骤执行。Furthermore, although the operations of the methods of the present invention are depicted in the figures in a particular order, this does not require or imply that the operations must be performed in the particular order, or that all illustrated operations must be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined to be performed as one step, and/or one step may be decomposed into multiple steps to be performed.

虽然已经参考若干具体实施方式描述了本发明的精神和原理，但是应该理解，本发明并不限于所公开的具体实施方式，对各方面的划分也不意味着这些方面中的特征不能组合以进行受益，这种划分仅是为了表述的方便。本发明旨在涵盖所附权利要求的精神和范围内所包括的各种修改和等同布置。While the spirit and principles of the present invention have been described with reference to a number of specific embodiments, it should be understood that the invention is not limited to the specific embodiments disclosed, nor does the division of aspects imply that features of these aspects cannot be combined to perform Benefit, this division is only for convenience of presentation. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

通过上述的描述，本发明的实施例提供了以下的技术方案，但不限于此：Through the above description, the embodiments of the present invention provide the following technical solutions, but are not limited thereto:

1.一种多智能体行动策略学习方法，包括：1. A multi-agent action strategy learning method, comprising:

2.如方案1所述的方法，其中，所述多智能体分别根据各自的初始行动策略执行相应的动作之前，所述方法还包括：2. The method according to solution 1, wherein before the multi-agents perform corresponding actions according to their respective initial action strategies, the method further comprises:

3.如方案2所述的方法，其中，采用加权逻辑回归法基于真实行动数据进行预训练以分别得到所述多智能体各自的初始行动策略。3. The method according to solution 2, wherein a weighted logistic regression method is used to perform pre-training based on real action data to obtain respective initial action strategies of the multi-agents.

4.如方案1所述的方法，其中，分别估计所述多智能体执行相应动作后获得的优势，包括：4. The method of scheme 1, wherein the advantages obtained after the multi-agents perform corresponding actions are estimated separately, including:

5.如方案1所述的方法，其中，所述多智能体包括用户智能体和系统智能体，所述用户智能体被配置为以完成任务为目标，所述系统智能体被配置为协助用户智能体完成任务。5. The method of claim 1, wherein the multi-agent includes a user agent configured to aim at completing a task and a system agent configured to assist the user The agent completes the task.

6.如方案5所述的方法，其中，各智能体到达某一状态后的回报至少包括其自身的回报和全局回报。6. The method according to solution 5, wherein the rewards of each agent after reaching a certain state include at least its own rewards and global rewards.

7.如方案6所述的方法，其中，所述用户智能体自身的回报至少包括无行动的惩罚、存在未完成子任务时执行新子任务的惩罚以及用户行动奖励。7. The method of claim 6, wherein the reward of the user agent itself includes at least a penalty for inaction, a penalty for executing a new subtask when there is an uncompleted subtask, and a reward for user action.

8.如方案7所述的方法，其中，所述用户行动奖励基于用户智能体是否执行所有利于完成任务的行动确定。8. The method of claim 7, wherein the user action reward is determined based on whether the user agent performs all actions conducive to completing the task.

9.如方案6所述的方法，其中，所述系统智能体自身的回报至少由以下之一累积而得：无行动的惩罚、用户智能体请求协助时未立即提供相应协助的惩罚以及子任务完成奖励。9. The method of claim 6, wherein the reward of the system agent itself is accumulated by at least one of the following: a penalty for inaction, a penalty for not immediately providing corresponding assistance when the user agent requests assistance, and subtasks Completion reward.

10.如方案6所述的方法，其中，所述全局回报至少由以下之一累积而得效率损失惩罚、子任务完成奖励以及全部任务完成奖励。10. The method of claim 6, wherein the global reward is accumulated by at least one of the following: an efficiency loss penalty, a subtask completion reward, and a full task completion reward.

11.如方案8或10所述的方法，其中，所述用户行动奖励或全部任务完成奖励在所有行动终止后计算。11. The method of claim 8 or 10, wherein the user action reward or all task completion rewards are calculated after all actions are terminated.

12.如方案3所述的方法，其中，各个智能体某一状态的回报由对应的价值网络计算。12. The method according to solution 3, wherein the reward of each agent in a certain state is calculated by the corresponding value network.

13.如方案12所述的方法，其中，不同类别的回报分别由不同的价值网络计算。13. The method of claim 12, wherein the rewards for different classes are calculated by different value networks, respectively.

14.如方案13所述的方法，其中，在所述多智能体的策略学习过程中，所述价值网络被配置为以最小化采用预设方法预测的回报与所述价值网络计算的回报之间的方差为目标进行优化更新。14. The method of claim 13, wherein, during the multi-agent policy learning process, the value network is configured to minimize the difference between the return predicted using a preset method and the return calculated by the value network. The variance between them is optimized to update the target.

15.如方案14所述的方法，其中，所述预设方法被配置为基于某一智能体的状态预测回报。15. The method of claim 14, wherein the preset method is configured to predict rewards based on the state of an agent.

16.如方案15所述的方法，其中，所述预设方法为时序差分算法。16. The method of claim 15, wherein the preset method is a sequential difference algorithm.

17.如方案14所述的方法，其中，引入目标网络对所述价值网络进行优化更新以使训练过程更加稳定。17. The method of claim 14, wherein a target network is introduced to optimize and update the value network to make the training process more stable.

18.一种多智能体行动策略学习装置，包括多个多智能体和多个价值网络，其中，所述多个智能体分别根据各自的初始行动策略采样相应的动作；18. A multi-agent action strategy learning device, comprising a plurality of multi-agents and a plurality of value networks, wherein the plurality of agents respectively sample corresponding actions according to their respective initial action strategies;

19.如方案18所述的装置，其中，所述装置还包括：19. The apparatus of claim 18, wherein the apparatus further comprises:

20.如方案19所述的装置，其中，所述预训练模块，被配置为采用加权逻辑回归法基于真实行动数据进行预训练以分别得到所述多智能体各自的初始行动策略。20. The apparatus of claim 19, wherein the pre-training module is configured to perform pre-training based on real action data using a weighted logistic regression method to obtain respective initial action strategies of the multi-agents.

21.如方案18所述的装置，其中，所述装置还包括多个状态编码模块，被配置为分别获取各智能体执行相应动作后所达到的新状态；21. The apparatus according to claim 18, wherein the apparatus further comprises a plurality of state encoding modules, configured to respectively acquire new states reached after each agent performs a corresponding action;

22.如方案18所述的装置，其中，所述多智能体包括用户智能体和系统智能体，所述用户智能体被配置为以完成任务为目标，所述系统智能体被配置为协助用户智能体完成任务。22. The apparatus of clause 18, wherein the multi-agent comprises a user agent and a system agent, the user agent being configured to target completion of a task, the system agent being configured to assist the user The agent completes the task.

23.如方案22所述的装置，其中，各智能体到达某一状态后的回报至少包括其自身的回报和全局回报。23. The apparatus according to claim 22, wherein the rewards of each agent after reaching a certain state include at least its own rewards and global rewards.

24.如方案23所述的装置，其中，所述用户智能体自身的回报至少包括无行动的惩罚、存在未完成子任务时执行新子任务的惩罚以及用户行动奖励。24. The apparatus of claim 23, wherein the reward of the user agent itself includes at least a penalty for inaction, a penalty for performing a new subtask when there are uncompleted subtasks, and a user action reward.

25.如方案24所述的装置，其中，所述用户行动奖励基于用户智能体是否执行所有利于完成任务的行动确定。25. The apparatus of clause 24, wherein the user action reward is determined based on whether the user agent performs all actions conducive to completing the task.

26.如方案23所述的装置，其中，所述系统智能体自身的回报至少由以下之一累积而得：无行动的惩罚、用户智能体请求协助时未立即提供相应协助的惩罚以及子任务完成奖励。26. The apparatus of claim 23, wherein the reward of the system agent itself is accumulated by at least one of the following: a penalty for inaction, a penalty for not immediately providing corresponding assistance when the user agent requests assistance, and subtasks Completion reward.

27.如方案23所述的装置，其中，所述全局回报至少由以下之一累积而得效率损失惩罚、子任务完成奖励以及全部任务完成奖励。27. The apparatus of claim 23, wherein the global reward is accumulated by at least one of an efficiency loss penalty, a subtask completion reward, and an overall task completion reward.

28.如方案25或27所述的装置，其中，所述用户行动奖励或全部任务完成奖励在行动终止后计算。28. The apparatus of clause 25 or 27, wherein the user action reward or all task completion reward is calculated after the action is terminated.

29.如方案20所述的装置，其中，各个智能体某一状态的回报由对应的价值网络计算。29. The apparatus of claim 20, wherein the reward for a certain state of each agent is calculated by the corresponding value network.

30.如方案29所述的装置，其中，不同类别的回报分别由不同的价值网络计算。30. The apparatus of clause 29, wherein different classes of rewards are calculated by different value networks, respectively.

31.如方案30所述的装置，其中，在所述多智能体的策略学习过程中，所述价值网络被配置为以最小化采用预设方法预测的回报与所述价值网络计算的回报之间的方差为目标进行优化更新。31. The apparatus of claim 30, wherein, during the multi-agent policy learning process, the value network is configured to minimize the difference between the return predicted using a preset method and the return calculated by the value network. The variance between them is optimized to update the target.

32.如方案31所述的装置，其中，所述预设方法被配置为基于某一智能体的状态预测回报。32. The apparatus of claim 31, wherein the predetermined method is configured to predict rewards based on the state of an agent.

33.如方案32所述的装置，其中，所述预设方法为时序差分算法。33. The apparatus of claim 32, wherein the preset method is a sequential difference algorithm.

34.如方案31所述的装置，其中，引入目标网络对所述价值网络进行优化更新以使训练过程更加稳定。34. The apparatus of claim 31, wherein a target network is introduced to optimize and update the value network to make the training process more stable.

35.一种介质，其上存储有计算机程序，其特征在于：所述计算机程序被处理器执行时实现如方案1-17中任一项所述的方法。35. A medium on which a computer program is stored, characterized in that: when the computer program is executed by a processor, the method according to any one of the solutions 1-17 is implemented.

36.一种计算设备，特征在于：所述计算机设备包括处理器，所述处理器用于执行存储器中存储的计算机程序时实现如方案1-17中任一项所述的方法。36. A computing device, characterized in that: the computer device comprises a processor, and the processor is configured to implement the method according to any one of the solutions 1-17 when executing a computer program stored in a memory.

Claims

1. A multi-agent action strategy learning method, comprising:

The multi-agents sample corresponding actions according to their respective initial action strategies;

separately estimating the advantages obtained by the multi-agents after performing the corresponding actions;

The action strategy of each agent is updated based on the advantages obtained after the multi-agent performs the corresponding action, so that each updated action strategy can enable the corresponding agent to obtain a higher reward.

2. The method according to claim 1, wherein before the multi-agents perform corresponding actions according to their respective initial action strategies, the method further comprises:

Pre-training is performed using real action data to obtain the respective initial action policies of the multi-agents.

3. The method of claim 2, wherein a weighted logistic regression method is used to perform pre-training based on real action data to obtain respective initial action strategies of the multi-agents.

4. The method of claim 1, wherein separately estimating the advantages obtained by the multi-agents after performing the corresponding actions, comprising:

Obtain the new state achieved by each agent after performing the corresponding action;

Calculate the current state of each agent and the return after reaching the new state;

According to the current state of each agent and the reward after reaching the new state, the advantages obtained by each agent after performing the corresponding action are calculated.

5. A multi-agent action strategy learning device, comprising a plurality of multi-agents and a plurality of value networks, wherein the plurality of agents respectively sample corresponding actions according to their respective initial action strategies;

The plurality of value networks respectively estimate the advantages obtained by the multi-agents after performing the corresponding actions;

The multiple agents update their respective action strategies based on the advantages obtained after performing the corresponding actions, so that the updated respective action strategies can enable the corresponding agents to obtain higher returns.

6. The apparatus of claim 5, wherein the apparatus further comprises:

The pre-training module is configured to perform pre-training using real action data to obtain respective initial action strategies of the multi-agents.

7. The apparatus of claim 6, wherein the pre-training module is configured to perform pre-training based on real action data using a weighted logistic regression method to obtain respective initial action strategies of the multi-agents.

8. The apparatus according to claim 5, wherein the apparatus further comprises a plurality of state encoding modules, configured to respectively acquire new states reached after each agent performs a corresponding action;

The multiple value networks are also configured to calculate the current state of each agent and the reward after reaching the new state, respectively; and calculate the return obtained after each agent performs the corresponding action according to the current state of each agent and the return after reaching the new state. Advantage.

9. A medium on which a computer program is stored, characterized in that: when the computer program is executed by a processor, the method according to any one of claims 1-4 is implemented.

10. A computing device, characterized in that: the computer device comprises a processor, and the processor is configured to implement the method according to any one of claims 1-4 when executing a computer program stored in a memory.