CN115648204A

CN115648204A - Training method, device, equipment and storage medium of intelligent decision model

Info

Publication number: CN115648204A
Application number: CN202211172621.9A
Authority: CN
Inventors: 胡纪锋; 孙妍超; 陈贺昌; 黄思理; 朴海音; 常毅; 孙力超
Original assignee: Jilin University
Current assignee: Jilin University
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2023-01-31
Anticipated expiration: 2042-09-26
Also published as: CN115648204B

Abstract

The application discloses a training method, device, equipment and storage medium of an intelligent decision-making model, belonging to the field of computer technology. Through the technical solution provided by the embodiment of this application, the external information collected by the robot in the target environment is obtained, the external information is input into the intelligent decision-making model, and the distributed executor model of the intelligent decision-making model outputs multiple action branches. The action branches are actions that the robot may perform in the target environment when the external information is acquired. Based on the external information and the multiple action branches, the distribution of the reward value of each action branch is determined, that is, the multiple action branches are all evaluated. Based on the distribution of reward values of multiple action branches, reward aggregation is performed to determine mixed rewards and integrated rewards. Training the intelligent decision-making model based on the mixed reward, the integrated reward and external information can achieve a relatively stable training effect.

Description

Training method, device, equipment and storage medium of intelligent decision-making model

技术领域technical field

本申请涉及计算机技术领域，特别涉及一种智能决策模型的训练方法、装置、设备以及存储介质。The present application relates to the field of computer technology, in particular to a training method, device, equipment and storage medium for an intelligent decision-making model.

背景技术Background technique

随着计算机技术的发展，多智能体强化学习(Multi-Agent ReinforcementLearning，MARL)在大型实时对战游戏，机器人控制，自动驾驶以及股票交易等领域中取得了巨大的进步。With the development of computer technology, Multi-Agent Reinforcement Learning (MARL) has made great progress in the fields of large-scale real-time battle games, robot control, automatic driving and stock trading.

在多智能体环境当中，受智能体之间潜在的相互影响以及环境本身的不确定性等因素干扰，智能体获取到的奖励值会具有不确定性，该具有不确定性的奖励值会影响多智能体的学习。比如，在包括多个机器人的环境中，机器人之间的交互以及环境本身的不确定性均会影响机器人的智能决策模型的训练效果，该智能决策模型也即是强化学习模型。想要实现多智能体强化学习模型的稳定学习就变得十分困难和棘手，因此，亟须一种能够实现多智能体强化学习模型的稳定学习的方法。In a multi-agent environment, affected by factors such as the potential interaction between agents and the uncertainty of the environment itself, the reward value obtained by the agent will be uncertain, and the reward value with uncertainty will affect Multi-Agent Learning. For example, in an environment including multiple robots, the interaction between robots and the uncertainty of the environment itself will affect the training effect of the robot's intelligent decision-making model, which is also a reinforcement learning model. It becomes very difficult and tricky to realize the stable learning of the multi-agent reinforcement learning model. Therefore, there is an urgent need for a method that can realize the stable learning of the multi-agent reinforcement learning model.

发明内容Contents of the invention

本申请实施例提供了一种智能决策模型的训练方法、装置、设备以及存储介质，可以提升机器人的决策模型的训练效果。所述技术方案如下：Embodiments of the present application provide a training method, device, device, and storage medium for an intelligent decision-making model, which can improve the training effect of a robot's decision-making model. Described technical scheme is as follows:

一方面，提供了一种智能决策模型的训练方法，所述方法包括：On the one hand, a kind of training method of intelligent decision-making model is provided, and described method comprises:

获取机器人在目标环境中采集到的外部信息，所述外部信息包括外部环境信息和交互信息，所述外部环境信息为所述机器人观察所述目标环境得到的信息，所述交互信息为所述机器人与所述目标环境中的其他机器人进行交互得到的信息；Obtain external information collected by the robot in the target environment, the external information includes external environment information and interaction information, the external environment information is information obtained by the robot observing the target environment, and the interaction information is information from interactions with other robots in the target environment;

将所述外部信息输入所述机器人的智能决策模型，由所述智能决策模型的分布式执行者模型基于所述外部信息进行预测，输出所述机器人的多个动作分支，所述动作分支为所述机器人在所述目标环境中可能执行的动作；The external information is input into the intelligent decision-making model of the robot, and the distributed executor model of the intelligent decision-making model performs prediction based on the external information, and outputs multiple action branches of the robot, and the action branches are all Actions that the robot may perform in the target environment;

基于所述外部信息和所述多个动作分支，确定各个所述动作分支的奖励值分布；determining a distribution of reward values for each of the action branches based on the external information and the plurality of action branches;

对在各个所述动作分支的奖励值分布中采样得到的采样奖励值进行奖励聚合，得到混合奖励和集成奖励；Reward aggregation is performed on the sampling reward values sampled in the reward value distribution of each of the action branches to obtain a mixed reward and an integrated reward;

基于所述混合奖励、所述集成奖励以及所述外部信息，对所述机器人的智能决策模型进行训练。An intelligent decision-making model of the robot is trained based on the mixed reward, the integrated reward and the external information.

在一种可能的实施方式中，所述由所述智能决策模型的分布式执行者模型基于所述外部信息进行预测，输出所述机器人的多个动作分支包括：In a possible implementation manner, the prediction is performed by the distributed executor model of the intelligent decision-making model based on the external information, and outputting multiple action branches of the robot includes:

由所述智能决策模型的分布式执行者模型对所述外部信息进行全连接、卷积以及注意力编码中的至少一项，得到所述外部信息的外部信息特征；Performing at least one of full connection, convolution, and attention encoding on the external information by the distributed executor model of the intelligent decision-making model to obtain the external information characteristics of the external information;

由所述智能决策模型的分布式执行者模型对所述外部信息特征进行全连接和归一化，输出所述机器人的多个动作分支。The distributed executor model of the intelligent decision-making model performs full connection and normalization on the external information features, and outputs multiple action branches of the robot.

在一种可能的实施方式中，所述基于所述外部信息和所述多个动作分支，确定各个所述动作分支的奖励值分布包括：In a possible implementation manner, the determining the reward value distribution of each of the action branches based on the external information and the plurality of action branches includes:

将所述外部信息和所述多个动作分支输入奖励值估计模型，通过所述奖励值估计模型基于所述外部信息和所述多个动作分支进行奖励值分布估计，输出各个所述动作分支的奖励值分布。Inputting the external information and the plurality of action branches into a reward value estimation model, performing reward value distribution estimation based on the external information and the plurality of action branches through the reward value estimation model, and outputting each of the action branches Distribution of reward values.

在一种可能的实施方式中，所述对在各个所述动作分支的奖励值分布中采样得到的采样奖励值进行奖励聚合，得到混合奖励和集成奖励包括：In a possible implementation manner, performing reward aggregation on the sampled reward values sampled in the reward value distribution of each of the action branches to obtain a mixed reward and an integrated reward includes:

对各个所述动作分支的奖励值分布进行采样，得到各个所述动作分支的采样奖励值；Sampling the reward value distribution of each of the action branches to obtain the sampling reward value of each of the action branches;

对所述各个动作分支的采样奖励值进行策略加权融合，得到所述混合奖励；performing strategy weighted fusion on the sampling reward values of the various action branches to obtain the mixed reward;

对各个所述动作分支的采样奖励值进行全局平均、局部平均以及直接选取中的任一项，得到所述集成奖励。Perform any one of global average, local average and direct selection on the sampling reward values of each of the action branches to obtain the integrated reward.

在一种可能的实施方式中，所述基于所述混合奖励、所述集成奖励以及所述外部信息，对所述机器人的智能决策模型的分布式执行者模型进行训练包括：In a possible implementation manner, the training of the distributed executor model of the robot's intelligent decision-making model based on the mixed reward, the integrated reward and the external information includes:

基于所述外部信息训练所述智能决策模型的奖励值估计模型；training a reward value estimation model of the intelligent decision-making model based on the external information;

基于所述混合奖励、所述外部信息和所述智能决策模型的评论家模型，获取所述多个动作分支的动作优势值，所述评论家模型用于基于所述外部信息对所述多个动作分支进行评价；Based on the mixed reward, the external information, and the critic model of the intelligent decision-making model, the action advantage values of the multiple action branches are obtained, and the critic model is used to evaluate the multiple action branches based on the external information. The action branch is evaluated;

基于所述动作优势值和所述集成奖励，对所述智能决策模型的分布式执行者模型进行训练。Based on the action advantage value and the integration reward, the distributed executor model of the intelligent decision-making model is trained.

在一种可能的实施方式中，所述基于所述混合奖励、所述外部信息和所述智能决策模型的评论家模型，获取所述多个动作分支的动作优势值包括：In a possible implementation manner, the acquiring action advantage values of the plurality of action branches based on the mixed reward, the external information, and the critic model of the intelligent decision-making model includes:

基于所述混合奖励和所述外部信息训练所述智能决策模型的评论家模型；将所述外部信息和所述多个动作分支输入所述评论家模型，由所述评论家模型输出所述多个动作分支的动作优势值。Train the critic model of the intelligent decision-making model based on the mixed reward and the external information; input the external information and the multiple action branches into the critic model, and output the multiple action branches from the critic model The action advantage value of each action branch.

在一种可能的实施方式中，所述基于所述动作优势值和所述集成奖励，对所述智能决策模型进行训练包括：In a possible implementation manner, the training the intelligent decision-making model based on the action advantage value and the integrated reward includes:

基于所述动作优势值和所述集成奖励，构成所述分布式执行者模型的损失函数；Constructing a loss function of the distributed executor model based on the action advantage value and the integrated reward;

基于所述损失函数，采用梯度下降法对所述分布式执行者模型进行训练。Based on the loss function, the distributed executor model is trained using a gradient descent method.

一方面，提供了一种智能决策模型的训练装置，所述装置包括：In one aspect, a training device for an intelligent decision-making model is provided, the device comprising:

外部信息获取模块，用于获取机器人在目标环境中采集到的外部信息，所述外部信息包括外部环境信息和交互信息，所述外部环境信息为所述机器人观察所述目标环境得到的信息，所述交互信息为所述机器人与所述目标环境中的其他机器人进行交互得到的信息；An external information acquisition module, configured to acquire external information collected by the robot in the target environment, the external information includes external environment information and interaction information, the external environment information is information obtained by the robot observing the target environment, and the The interaction information is information obtained by the interaction between the robot and other robots in the target environment;

动作预测模块，用于将所述外部信息输入所述机器人的智能决策模型，由所述智能决策模型的分布式执行者模型基于所述外部信息进行预测，输出所述机器人的多个动作分支，所述动作分支为所述机器人在所述目标环境中可能执行的动作；an action prediction module, configured to input the external information into the intelligent decision-making model of the robot, and the distributed executor model of the intelligent decision-making model performs prediction based on the external information, and outputs multiple action branches of the robot, The action branch is an action that the robot may perform in the target environment;

奖励值预测模块，用于基于所述外部信息和所述多个动作分支，确定各个所述动作分支的奖励值分布；A reward value prediction module, configured to determine the reward value distribution of each of the action branches based on the external information and the plurality of action branches;

奖励值聚合模块，用于对在各个所述动作分支的奖励值分布中采样得到的采样奖励值进行奖励聚合，得到混合奖励和集成奖励；A reward value aggregation module, configured to perform reward aggregation on sampled reward values sampled in the reward value distribution of each of the action branches to obtain mixed rewards and integrated rewards;

训练模块，用于基于所述混合奖励、所述集成奖励以及所述外部信息，对所述机器人的智能决策模型进行训练。A training module, configured to train an intelligent decision-making model of the robot based on the mixed reward, the integrated reward and the external information.

在一种可能的实施方式中，所述动作预测模块，用于由所述智能决策模型的分布式执行者模型对所述外部信息进行全连接、卷积以及注意力编码中的至少一项，得到所述外部信息的外部信息特征；由所述智能决策模型的分布式执行者模型对所述外部信息特征进行全连接和归一化，输出所述机器人的多个动作分支。In a possible implementation manner, the action prediction module is configured to perform at least one of full connection, convolution and attention coding on the external information by the distributed executor model of the intelligent decision-making model, The external information features of the external information are obtained; the distributed executor model of the intelligent decision-making model performs full connection and normalization on the external information features, and outputs multiple action branches of the robot.

在一种可能的实施方式中，所述奖励值预测模块，用于将所述外部信息和所述多个动作分支输入奖励值估计模型，通过所述奖励值估计模型基于所述外部信息和所述多个动作分支进行奖励值分布估计，输出各个所述动作分支的奖励值分布。In a possible implementation manner, the reward value prediction module is configured to input the external information and the plurality of action branches into a reward value estimation model, and the reward value estimation model is based on the external information and the The multiple action branches are used to estimate the reward value distribution, and the reward value distribution of each of the action branches is output.

在一种可能的实施方式中，所述奖励值聚合模块，用于对各个所述动作分支的奖励值分布进行采样，得到各个所述动作分支的采样奖励值；对所述各个动作分支的采样奖励值进行策略加权融合，得到所述混合奖励；对各个所述动作分支的采样奖励值进行全局平均、局部平均以及直接选取中的任一项，得到所述集成奖励。In a possible implementation manner, the reward value aggregation module is configured to sample the reward value distribution of each of the action branches to obtain the sampling reward value of each of the action branches; the sampling of each of the action branches The reward value is fused by policy weighting to obtain the mixed reward; the sampling reward value of each of the action branches is subjected to any one of global average, local average and direct selection to obtain the integrated reward.

在一种可能的实施方式中，所述训练模块，用于基于所述混合奖励、所述外部信息和所述智能决策模型的评论家模型，获取所述多个动作分支的动作优势值，所述评论家模型用于基于所述外部信息对所述多个动作分支进行评价；基于所述动作优势值和所述集成奖励，对所述智能决策模型的分布式执行者模型进行训练。In a possible implementation manner, the training module is configured to acquire action advantage values of the plurality of action branches based on the mixed reward, the external information, and the critic model of the intelligent decision-making model, so The critic model is used to evaluate the multiple action branches based on the external information; based on the action advantage value and the integrated reward, the distributed executor model of the intelligent decision-making model is trained.

在一种可能的实施方式中，所述训练模块，用于基于所述外部信息训练所述智能决策模型的奖励值估计模型；In a possible implementation manner, the training module is configured to train a reward value estimation model of the intelligent decision-making model based on the external information;

在一种可能的实施方式中，所述训练模块，用于基于所述混合奖励和所述外部信息训练所述评论家模型；将所述外部信息和所述多个动作分支输入所述评论家模型，由所述评论家模型输出所述多个动作分支的动作优势值。In a possible implementation manner, the training module is configured to train the critic model based on the mixed reward and the external information; input the external information and the plurality of action branches into the critic A model, wherein the critic model outputs the action advantage values of the plurality of action branches.

在一种可能的实施方式中，所述训练模块，用于基于所述动作优势值和所述集成奖励，构成所述分布式执行者模型的损失函数；基于所述损失函数，采用梯度下降法对所述分布式执行者模型进行训练。In a possible implementation manner, the training module is configured to form a loss function of the distributed actor model based on the action advantage value and the integrated reward; based on the loss function, a gradient descent method is used The distributed executor model is trained.

一方面，提供了一种计算机设备，所述计算机设备包括一个或多个处理器和一个或多个存储器，所述一个或多个存储器中存储有至少一条计算机程序，所述计算机程序由所述一个或多个处理器加载并执行以实现所述智能决策模型的训练方法。In one aspect, a computer device is provided, the computer device includes one or more processors and one or more memories, at least one computer program is stored in the one or more memories, and the computer program is executed by the One or more processors are loaded and executed to realize the training method of the intelligent decision-making model.

一方面，提供了一种计算机可读存储介质，所述计算机可读存储介质中存储有至少一条计算机程序，所述计算机程序由处理器加载并执行以实现所述智能决策模型的训练方法。In one aspect, a computer-readable storage medium is provided, wherein at least one computer program is stored in the computer-readable storage medium, and the computer program is loaded and executed by a processor to realize the training method of the intelligent decision-making model.

一方面，提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括程序代码，该程序代码存储在计算机可读存储介质中，计算机设备的处理器从计算机可读存储介质读取该程序代码，处理器执行该程序代码，使得该计算机设备执行上述智能决策模型的训练方法。In one aspect, a computer program product or computer program is provided, the computer program product or computer program includes program code, the program code is stored in a computer-readable storage medium, and a processor of a computer device reads the program code from the computer-readable storage medium The program code, the processor executes the program code, so that the computer device executes the above-mentioned intelligent decision-making model training method.

通过本申请实施例提供的技术方案，获取了机器人在目标环境中采集到的外部信息，将外部信息输入智能决策模型，由智能决策模型的分布式执行者模型输出多个动作分支，该多个动作分支均是在获取到该外部信息的情况下，该机器人在该目标环境中可能执行的动作。基于外部信息和该多个动作分支，确定各个动作分支的奖励值分布，也即是对多个动作分支均进行了评价。基于多个动作分支的奖励值分布，进行奖励聚合确定混合奖励和集成奖励。基于所述混合奖励和所述集成奖励以及外部信息对该智能决策模型进行训练，能够达到较为稳定的训练效果。Through the technical solution provided by the embodiment of this application, the external information collected by the robot in the target environment is obtained, the external information is input into the intelligent decision-making model, and the distributed executor model of the intelligent decision-making model outputs multiple action branches. The action branches are actions that the robot may perform in the target environment when the external information is acquired. Based on the external information and the multiple action branches, the distribution of the reward value of each action branch is determined, that is, the multiple action branches are all evaluated. Based on the distribution of reward values of multiple action branches, reward aggregation is performed to determine mixed rewards and integrated rewards. Training the intelligent decision-making model based on the mixed reward, the integrated reward and external information can achieve a relatively stable training effect.

附图说明Description of drawings

为了更清楚地说明本申请实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单的介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1是本申请实施例提供的一种智能决策模型的训练方法的实施环境的示意图；FIG. 1 is a schematic diagram of an implementation environment of a training method for an intelligent decision-making model provided in an embodiment of the present application;

图2是本申请实施例提供的一种智能决策模型的训练方法流程图；Fig. 2 is a flow chart of a training method for an intelligent decision-making model provided by an embodiment of the present application;

图3是本申请实施例提供的一种智能决策模型的训练方法流程图；Fig. 3 is a flow chart of a training method for an intelligent decision-making model provided by an embodiment of the present application;

图4是本申请实施例提供的一种智能决策模型的训练方法的框架图；FIG. 4 is a framework diagram of a training method for an intelligent decision-making model provided in an embodiment of the present application;

图5是本申请实施例提供的一种实验结果示意图；Fig. 5 is a schematic diagram of an experimental result provided in an embodiment of the present application;

图6是本申请实施例提供的一种智能决策模型的训练装置的结构示意图；FIG. 6 is a schematic structural diagram of a training device for an intelligent decision-making model provided in an embodiment of the present application;

图7是本申请实施例提供的一种终端的结构示意图；FIG. 7 is a schematic structural diagram of a terminal provided in an embodiment of the present application;

图8是本申请实施例提供的一种服务器的结构示意图。FIG. 8 is a schematic structural diagram of a server provided by an embodiment of the present application.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施方式做进一步的详细描述。In order to make the purpose, technical solution and advantages of the present application clearer, the implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings.

本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分，应理解，“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系，也不对数量和执行顺序进行限定。In this application, the terms "first" and "second" are used to distinguish the same or similar items with basically the same function and function. It should be understood that "first", "second" and "nth" There are no logical or timing dependencies, nor are there restrictions on quantity or order of execution.

为了对本申请实施例提供的技术方案进行说明，下面先对本申请实施例涉及的名词进行介绍。In order to describe the technical solutions provided by the embodiments of the present application, terms involved in the embodiments of the present application are firstly introduced below.

人工智能(Artificial Intelligence，AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能，感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

机器学习(Machine Learning，ML)是一门多领域交叉学科，涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为，以获取新的知识或技能，重新组织已有的知识子模型使之不断改善自身的性能。机器学习是人工智能的核心，是使计算机具有智能的根本途径，其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。Machine Learning (ML) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. Specialize in the study of how computers simulate or implement human learning behaviors to acquire new knowledge or skills, and reorganize existing knowledge sub-models to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its application pervades all fields of artificial intelligence. Machine learning and deep learning usually include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching and learning.

深度强化学习(Deep Reinforcement Learning，DRL)：深度强化学习将深度学习的感知能力和强化学习的决策能力相结合，可以直接根据输入的图像进行控制，是一种更接近人类思维方式的人工智能方法。深度强化学习是深度学习的一个分支，深度强化学习可以归结为一个五元组表示的马尔可夫决策过程(Markov Decision Processes，MDP)，即<S，A，R，P，γ>，分别表示环境状态、动作、奖励、状态转移矩阵、累积折扣因子。智能体从环境获取状态和奖励，并产生动作作用于环境，环境获取智能体的动作并根据当前状态产生下一个状态给智能体。智能体的目的是获取长期收益最大化。Deep Reinforcement Learning (DRL): Deep reinforcement learning combines the perception ability of deep learning and the decision-making ability of reinforcement learning, and can be directly controlled according to the input image. It is an artificial intelligence method closer to the way of human thinking. . Deep reinforcement learning is a branch of deep learning. Deep reinforcement learning can be attributed to a five-tuple representation of Markov decision process (Markov Decision Processes, MDP), that is, <S, A, R, P, γ>, respectively Environment state, action, reward, state transition matrix, cumulative discount factor. The agent obtains state and rewards from the environment, and generates actions to act on the environment, and the environment obtains the actions of the agent and generates the next state to the agent according to the current state. The purpose of the agent is to obtain long-term profit maximization.

归一化(Normalization)：将取值范围不同的数列映射到(0，1)区间上，便于数据的处理。在一些情况下，归一化后的数值可以直接被实现为概率。Normalization: Map the sequence of values with different value ranges to the (0, 1) interval to facilitate data processing. In some cases, normalized values can be implemented directly as probabilities.

学习率(Learning Rate)：用于控制模型的学习进度，学习率可以指导模型在梯度下降法中，如何使用损失函数的梯度调整网络权重。学习率如果过大，可能会使损失函数直接越过全局最优点，此时表现为损失过大；学习率如果过小，损失函数的变化速度很慢，会大大增加网络的收敛复杂度，并且很容易被困在局部最小值或者鞍点。Learning Rate: It is used to control the learning progress of the model. The learning rate can instruct the model how to use the gradient of the loss function to adjust the network weights in the gradient descent method. If the learning rate is too large, the loss function may directly cross the global optimal point, and the loss is too large at this time; if the learning rate is too small, the change speed of the loss function is very slow, which will greatly increase the convergence complexity of the network, and quickly It is easy to get trapped in local minima or saddle points.

嵌入编码(Embedded Coding)：嵌入编码在数学上表示一个对应关系，即通过一个函数F将X空间上的数据映射到Y空间上，其中该函数F是单射函数，映射的结果是结构保存，单射函数表示映射后的数据与映射前的数据唯一对应，结构保存表示映射前数据的大小关系后映射后数据的大小关系相同，例如映射前存在数据X₁以及X₂，映射后得到X₁对应的Y₁以及X₂对应的Y₂。若映射前的数据X₁＞X₂，那么相应地，映射后的数据Y₁大于Y₂。对于词语来说，就是将词语映射到另外一个空间，便于后续的机器学习和处理。Embedded coding (Embedded Coding): Embedded coding represents a corresponding relationship mathematically, that is, the data on the X space is mapped to the Y space through a function F, where the function F is an injective function, and the result of the mapping is a structure preservation. The injective function indicates that the data after mapping is uniquely corresponding to the data before mapping, and the structure preservation indicates that the size relationship of the data before mapping is the same after mapping. For example, there are data X ₁ and X ₂ before mapping, and X ₁ is obtained after mapping Y 1 corresponds to X ₂ and Y ₂ corresponds to X ₂ . If the data before mapping X ₁ >X ₂ , correspondingly, the data Y ₁ after mapping is greater than Y ₂ . For words, it is to map words to another space for subsequent machine learning and processing.

注意力权重(Attention Weight)：可以表示训练或预测过程中某个数据的重要性，重要性表示输入的数据对输出数据影响的大小。重要性高的数据其对应的注意力权重的值较高，重要性低的数据其对应的注意力权重的值较低。在不同的场景下，数据的重要性并不相同，模型的训练注意力权重的过程也即是确定数据重要性的过程。Attention Weight: It can indicate the importance of a certain data in the training or prediction process, and the importance indicates the influence of the input data on the output data. Data with high importance has a higher value of attention weight, and data with low importance has a lower value of attention weight. In different scenarios, the importance of data is not the same, and the process of training the attention weight of the model is also the process of determining the importance of data.

在介绍完本申请实施例中涉及的名词之外，下面对本申请实施例的实施环境进行介绍。After introducing the nouns involved in the embodiment of the present application, the implementation environment of the embodiment of the present application is introduced below.

图1是本申请实施例提供的一种智能决策模型的训练方法的实施环境示意图，参见图1，该实施环境中可以包括终端110和服务器140。FIG. 1 is a schematic diagram of an implementation environment of a method for training an intelligent decision-making model provided by an embodiment of the present application. Referring to FIG. 1 , the implementation environment may include a terminal 110 and a server 140 .

终端110通过无线网络或有线网络与服务器140相连。可选地，终端110是智能手机、平板电脑、笔记本电脑、台式计算机等，但并不局限于此。终端110安装和运行有支持智能决策模型训练的应用程序。The terminal 110 is connected to the server 140 through a wireless network or a wired network. Optionally, the terminal 110 is a smart phone, a tablet computer, a notebook computer, a desktop computer, etc., but not limited thereto. The terminal 110 is installed and runs an application program supporting intelligent decision-making model training.

服务器140是独立的物理服务器，或者是多个物理服务器构成的服务器集群或者分布式系统，或者是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、分发网络(Content Delivery Network，CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。服务器140为终端上运行的应用程序提供后台服务。The server 140 is an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, and middleware services , domain name service, security service, distribution network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms. The server 140 provides background services for applications running on the terminal.

本领域技术人员可以知晓，上述终端的数量可以更多或更少。比如上述终端仅为一个，或者上述终端为几十个或几百个，或者更多数量，此时上述实施环境中还包括其他终端。本申请实施例对终端的数量和设备类型不加以限定。Those skilled in the art may know that the number of the foregoing terminals may be more or less. For example, there is only one terminal above, or there are dozens or hundreds of terminals, or more, and at this time, the above implementation environment also includes other terminals. The embodiment of the present application does not limit the number of terminals and device types.

在介绍完本申请实施例的实施环境之后，下面对本申请实施例的应用场景进行介绍。After introducing the implementation environment of the embodiment of the present application, the application scenario of the embodiment of the present application is introduced below.

本申请实施例提供的技术方案能够应用在多机器人导航的场景下，采用本申请实施例提供的技术方案训练机器人的智能决策模型，智能决策模型能够根据机器人观察到的外部信息来输出动作，该动作能够实现对该机器人的导航。为处于一个环境中的多个机器人均配置该智能决策模型，能够实现对多机器人进行导航的目的。在上述多机器人导航的场景下，每个机器人可以视作一个智能体。The technical solution provided by the embodiment of this application can be applied in the scenario of multi-robot navigation, and the intelligent decision-making model of the robot can be trained by using the technical solution provided by the embodiment of the application. The intelligent decision-making model can output actions according to the external information observed by the robot. Actions enable navigation of the robot. Configuring the intelligent decision-making model for multiple robots in an environment can realize the purpose of multi-robot navigation. In the above multi-robot navigation scenario, each robot can be regarded as an agent.

或者，本申请实施例提供的技术方案也能够应用在其他包含多个智能体的场景中，比如，本申请实施例提供的技术方案能够应用在多车辆导航的场景下，还能够应用在多虚拟对象在游戏场景中进行对战的场景下等，本申请实施例对此不作限定。在下述说明过程中，以本申请实施例提供的技术方案应用在多机器人导航的场景下为例进行说明。Alternatively, the technical solutions provided by the embodiments of the present application can also be applied to other scenarios that include multiple agents. The embodiment of the present application does not limit the scene where the objects are engaged in battle in the game scene. In the following description process, the application of the technical solutions provided by the embodiments of the present application in the scenario of multi-robot navigation is taken as an example for description.

在介绍完本申请实施例的实施环境和应用场景之后，下面对本申请实施例提供的技术方案进行介绍。需要注意的是，在下述对本申请提供的技术方案进行说明的过程中，是以服务器作为执行主体为例进行的。在其他可能的实施方式中，也可以由终端作为执行主体来执行本申请提供的技术方案，本申请实施例对于执行主体的类型不作限定。After introducing the implementation environment and application scenarios of the embodiments of the present application, the technical solutions provided by the embodiments of the present application will be introduced below. It should be noted that, in the following description of the technical solution provided by the present application, the server is used as an execution subject as an example. In other possible implementation manners, the terminal may also be used as an execution subject to execute the technical solution provided in the present application, and the embodiment of the present application does not limit the type of the execution subject.

图2是本申请实施例提供的一种智能决策模型的训练方法的流程图，参见图2，以执行主体为服务器为例，方法包括下述步骤。Fig. 2 is a flow chart of a method for training an intelligent decision-making model provided by an embodiment of the present application. Referring to Fig. 2 , taking the execution subject as a server as an example, the method includes the following steps.

201、服务器获取机器人在目标环境中采集到的外部信息，该外部信息包括外部环境信息和交互信息，该外部环境信息为该机器人观察该目标环境得到的信息，该交互信息为该机器人与该目标环境中的其他机器人进行交互得到的信息。201. The server acquires the external information collected by the robot in the target environment. The external information includes external environment information and interaction information. The external environment information is the information obtained by the robot observing the target environment, and the interaction information is the Information obtained from interactions with other robots in the environment.

其中，目标环境包括多个机器人，上述步骤201中的机器人为该多个机器人中的任一机器人。外部信息包括外部环境信息和交互信息，该外部环境信息为该机器人观察该目标环境得到的信息，也即是该机器人通过多个传感器观察该目标环境得到的信息，比如，该外部信息包括该机器人的图像传感器获取的环境图像，还包括该机器人的位置传感器获取的位置，还包括该机器人的陀螺仪获取的姿态信息等，当然，不同机器人上设置有不同的传感器，该外部环境信息可以随着传感器的不同而不同。交互信息为该机器人与该目标环境中的其他机器人进行交互所得到的信息，比如，该交互信息包括该机器人与其他机器人的碰撞信息等。Wherein, the target environment includes a plurality of robots, and the robot in step 201 is any one of the plurality of robots. The external information includes external environment information and interaction information. The external environment information is the information obtained by the robot observing the target environment, that is, the information obtained by the robot observing the target environment through multiple sensors. For example, the external information includes the robot The environmental image acquired by the image sensor of the robot also includes the position acquired by the position sensor of the robot, and the attitude information acquired by the gyroscope of the robot. Sensors vary. The interaction information is information obtained by the robot interacting with other robots in the target environment. For example, the interaction information includes collision information between the robot and other robots.

202、服务器将该外部信息输入该机器人的智能决策模型，由该智能决策模型的分布式执行者模型基于该外部信息进行预测，输出该机器人的多个动作分支，该动作分支为该机器人在该目标环境中可能执行的动作。202. The server inputs the external information into the robot's intelligent decision-making model, and the distributed executor model of the intelligent decision-making model makes predictions based on the external information, and outputs multiple action branches of the robot. Actions that may be performed in the target environment.

其中，该智能决策模型的分布式执行者模型用于基于外部信息来预测动作，外部信息为机器人对目标环境的观察，该分布式执行者模型的功能也即是基于机器人的观察来做出执行哪个动作的决定。多个动作分支为该机器人在获取该外部信息的情况下，在该目标环境中可能执行的动作。Among them, the distributed executor model of the intelligent decision-making model is used to predict actions based on external information. The external information is the observation of the target environment by the robot. The function of the distributed executor model is to make execution based on the observation of the robot. The decision of which action to take. The multiple action branches are actions that the robot may perform in the target environment when the external information is acquired.

203、服务器基于该外部信息和该多个动作分支，确定各个该动作分支的奖励值分布。203. Based on the external information and the multiple action branches, the server determines the reward value distribution of each of the action branches.

其中，各个动作分支的奖励值分布是对在该外部信息的情况下，执行各个动作分支能够得到的潜在奖励分布，奖励分布与智能体采取的动作、智能体之间的交互以及外部信息相关，该智能体也即是机器人。Among them, the reward value distribution of each action branch is the potential reward distribution that can be obtained by executing each action branch in the case of the external information, and the reward distribution is related to the actions taken by the agent, the interaction between the agents and the external information. This agent is also a robot.

204、服务器对在各个该动作分支的奖励值分布中采样得到的采样奖励值进行奖励聚合，得到混合奖励和集成奖励。204. The server rewards and aggregates the sampled reward values sampled in the reward value distribution of each action branch to obtain a mixed reward and an integrated reward.

其中，混合奖励(Mixed reward Aggregation)用于训练评论家模型；集成奖励(Lumped reward Aggregation)用于训练分布式执行者模型。Among them, Mixed reward Aggregation is used to train the critic model; Lumped reward Aggregation is used to train the distributed executor model.

205、服务器基于该混合奖励、该集成奖励以及该外部信息，对该机器人的智能决策模型进行训练。205. The server trains an intelligent decision-making model of the robot based on the mixed reward, the integrated reward, and the external information.

通过本申请实施例提供的技术方案，获取了机器人在目标环境中采集到的外部信息，将外部信息输入智能决策模型，由智能决策模型的分布式执行者模型输出多个动作分支，该多个动作分支均是在获取到该外部信息的情况下，该机器人在该目标环境中可能执行的动作。基于外部信息和该多个动作分支，确定各个动作分支的奖励值分布，也即是对多个动作分支均进行了评价。基于多个动作分支的奖励值分布，进行奖励聚合确定混合奖励和集成奖励。基于该混合奖励和该集成奖励以及外部信息对该智能决策模型进行训练，能够达到较为稳定的训练效果。Through the technical solution provided by the embodiment of this application, the external information collected by the robot in the target environment is obtained, the external information is input into the intelligent decision-making model, and the distributed executor model of the intelligent decision-making model outputs multiple action branches. The action branches are actions that the robot may perform in the target environment when the external information is acquired. Based on the external information and the multiple action branches, the distribution of the reward value of each action branch is determined, that is, the multiple action branches are all evaluated. Based on the distribution of reward values of multiple action branches, reward aggregation is performed to determine mixed rewards and integrated rewards. Training the intelligent decision-making model based on the mixed reward, the integrated reward and external information can achieve a relatively stable training effect.

上述步骤201-205是对本申请实施例提供的技术方案的简单介绍，下面将结合一些例子，对本申请实施例提供的技术方案进行更加清楚地说明，参见图3和图4，以执行主体为服务器为例，方法包括下述步骤。The above-mentioned steps 201-205 are a brief introduction to the technical solution provided by the embodiment of the present application. The technical solution provided by the embodiment of the present application will be explained more clearly below with some examples. See Figure 3 and Figure 4, and the execution subject is the server As an example, the method includes the following steps.

301、服务器获取机器人在目标环境中采集到的外部信息，该外部信息包括外部环境信息和交互信息，该外部环境信息为该机器人观察该目标环境得到的信息，该交互信息为该机器人与该目标环境中的其他机器人进行交互得到的信息。301. The server obtains the external information collected by the robot in the target environment. The external information includes external environment information and interaction information. The external environment information is the information obtained by the robot observing the target environment, and the interaction information is the Information obtained from interactions with other robots in the environment.

其中，目标环境包括多个机器人，上述步骤301中的机器人为该多个机器人中的任一机器人。外部信息包括外部环境信息和交互信息，该外部环境信息为该机器人观察该目标环境得到的信息，也即是该机器人通过多个传感器观察该目标环境得到的信息，比如，该外部信息包括该机器人的图像传感器获取的环境图像，还包括该机器人的位置传感器获取的位置，还包括该机器人的陀螺仪获取的姿态信息等，当然，不同机器人上设置有不同的传感器，该外部环境信息可以随着传感器的不同而不同。交互信息为该机器人与该目标环境中的其他机器人进行交互所得到的信息，比如，该交互信息包括该机器人与其他机器人的碰撞信息等。在一些实施例中，该目标环境为仿真虚拟环境。Wherein, the target environment includes a plurality of robots, and the robot in step 301 is any one of the plurality of robots. The external information includes external environment information and interaction information. The external environment information is the information obtained by the robot observing the target environment, that is, the information obtained by the robot observing the target environment through multiple sensors. For example, the external information includes the robot The environmental image acquired by the image sensor of the robot also includes the position acquired by the position sensor of the robot, and the attitude information acquired by the gyroscope of the robot. Sensors vary. The interaction information is information obtained by the robot interacting with other robots in the target environment. For example, the interaction information includes collision information between the robot and other robots. In some embodiments, the target environment is a simulated virtual environment.

在一些实施例中，目标环境中的多个机器人通过分布式的交互方式与该目标环境进行交互，交互过程中每个机器人均能够在不同时刻采集到外部信息，多个机器人在不同时刻采集到的外部信息被存储在经验缓冲池(Replay Buffer)中，服务器能够从该经验缓冲池中获取机器人在不同时刻采集到外部信息，该经验缓冲池相当于一个数据库，该经验缓冲池中存储的外部信息能够在训练机器人的智能决策模型时作为训练样本，该经验缓冲池的形式参见图4。由于该机器人为多个机器人中的任一机器人，服务器能够基于该机器人的标识，在该经验缓冲池中获取该机器人在该目标环境中采集到的外部信息。在一些实施例中，该经验缓冲池中还存储有该目标环境中的机器人在该目标环境中执行的动作以及目标环境对机器人所执行动作的奖励值。In some embodiments, multiple robots in the target environment interact with the target environment in a distributed interactive manner. During the interaction process, each robot can collect external information at different times, and multiple robots can collect external information at different times. The external information is stored in the experience buffer pool (Replay Buffer), the server can obtain the external information collected by the robot at different times from the experience buffer pool, the experience buffer pool is equivalent to a database, and the external information stored in the experience buffer pool The information can be used as a training sample when training the robot's intelligent decision-making model. The form of the experience buffer pool is shown in Figure 4. Since the robot is any robot among the plurality of robots, the server can acquire the external information collected by the robot in the target environment from the experience buffer pool based on the robot's identifier. In some embodiments, the experience buffer pool also stores the actions performed by the robot in the target environment and the reward values of the actions performed by the robot in the target environment.

在深度强化学习的架构下，该外部信息也即是环境的状态，机器人也即是智能体，由于智能体间的相互作用引起的奖励不确定性随着智能体数量的增加呈指数增长，且状态－动作对对应的奖励可能是多个的，因此后续处理过程中选择对不同的动作分支进行奖励分布建模。一种直接的方法是直接选择在联合状态－动作空间上进行奖励分布建模。但随着智能体的数量的增加，由于智能体间的相互作用，该方法存在着巨大的不确定性。为了应对上述挑战，本申请实施例提出了另一种方法来实现该目标。从某一个智能体的角度看，将其他智能体视为环境的一部分来简化问题，这样我们只需要关注每个智能体的动作分支奖励分布估计，而不是所有智能体。在上述步骤中，也即是交互信息视作环境的一部分，进行动作预测时采用的外部信息包括外部环境信息和交互信息，从而简化了其他机器人对该机器人的影响。多动作分支奖励分布估计的思想直观上受到这样的启发：人类大脑总是可以基于已有的事实推测与此相关的决策行为可能带来的后果。Under the framework of deep reinforcement learning, the external information is also the state of the environment, and the robot is also the agent. The reward uncertainty caused by the interaction between the agents increases exponentially with the increase of the number of agents, and There may be multiple rewards corresponding to the state-action pair, so different action branches are selected for reward distribution modeling in the subsequent processing. A straightforward approach is to directly choose to model the reward distribution over the joint state-action space. But as the number of agents increases, the method suffers from huge uncertainties due to the interactions among agents. In order to cope with the above challenges, the embodiment of the present application proposes another method to achieve the goal. From the perspective of an agent, treating other agents as part of the environment simplifies the problem, so that we only need to focus on the estimation of the action branch reward distribution of each agent, rather than all agents. In the above steps, that is, the interaction information is regarded as a part of the environment, and the external information used for action prediction includes external environment information and interaction information, thereby simplifying the influence of other robots on the robot. The idea of multi-action branch reward distribution estimation is intuitively inspired by the fact that the human brain can always speculate on the possible consequences of related decision-making behaviors based on existing facts.

302、服务器将该外部信息输入该机器人的智能决策模型，由该智能决策模型的分布式执行者模型基于该外部信息进行预测，输出该机器人的多个动作分支，该动作分支为该机器人在该目标环境中可能执行的动作。302. The server inputs the external information into the robot's intelligent decision-making model, and the distributed executor model of the intelligent decision-making model makes predictions based on the external information, and outputs multiple action branches of the robot. Actions that may be performed in the target environment.

其中，该分布式执行者模型用于基于外部信息来预测动作，外部信息为机器人对目标环境的观察，该分布式执行者模型的功能也即是基于机器人的观察来做出执行哪个动作的决定。多个动作分支为该机器人在获取该外部信息的情况下，在该目标环境中可能执行的动作。在一些实施例中，该分布式执行者模型也被称为分布式执行者网络(Decentralized Actor)。Among them, the distributed executor model is used to predict actions based on external information. The external information is the observation of the target environment by the robot. The function of the distributed executor model is to make a decision on which action to execute based on the observation of the robot. . The multiple action branches are actions that the robot may perform in the target environment when the external information is obtained. In some embodiments, the distributed actor model is also called a distributed actor network (Decentralized Actor).

在一种可能的实施方式中，服务器将外部信息输入该机器人的分布式执行者模型，由该分布式执行者模型基于该外部信息以及决策策略进行预测，输出该机器人的多个动作分支。In a possible implementation, the server inputs external information into the robot's distributed actor model, and the distributed actor model makes predictions based on the external information and decision-making strategies, and outputs multiple action branches of the robot.

其中，决策策略(Policy)也即是该分布式执行者模型的模型参数，训练该分布式执行者模型也即是优化该决策策略(调整该模型参数)，从而使得该分布式执行者模型能够根据外部信息进行更加精准的动作预测。Among them, the decision policy (Policy) is also the model parameter of the distributed executor model, and training the distributed executor model is to optimize the decision policy (adjust the model parameters), so that the distributed executor model can More accurate action prediction based on external information.

举例来说，服务器将该外部信息输入该机器人的分布式执行者模型，由该分布式执行者模型对该外部信息进行特征提取，得到该外部信息的外部信息特征，比如，由该分布式执行者模型对该外部信息进行全连接、卷积以及注意力编码中的至少一项，得到该外部信息的外部信息特征。由该分布式执行者模型对该外部信息特征进行全连接和归一化，输出该多个动作分支。其中，该分布式执行者模型对该外部信息特征进行全连接和归一化之后，能够得到概率集合，概率集合包括多个概率，每个概率对应于一个动作。服务器将该概率集合中最高的前目标数量个概率对应的动作确定为动作分支。比如，该机器人在该目标环境中能够执行的动作数量为N个，该分布式执行者模型输出的概率集合包括该N个动作的概率，服务器将N个动作中的M个可选动作(M也可以等于N)确定为动作分支。参见图4，该机器人的分布式执行者模型为图4中的Actor。For example, the server inputs the external information into the distributed executor model of the robot, and the distributed executor model performs feature extraction on the external information to obtain the external information features of the external information, for example, the distributed executor The former model performs at least one of full connection, convolution and attention encoding on the external information to obtain the external information features of the external information. The distributed executor model performs full connection and normalization on the external information features, and outputs the multiple action branches. Wherein, after the distributed executor model is fully connected and normalized to the external information features, a probability set can be obtained, and the probability set includes multiple probabilities, and each probability corresponds to an action. The server determines the action corresponding to the highest number of probabilities of the previous target in the probability set as the action branch. For example, the number of actions that the robot can perform in the target environment is N, the probability set output by the distributed executor model includes the probability of the N actions, and the server selects M optional actions (M It can also be equal to N) to be determined as an action branch. Referring to Figure 4, the distributed executor model of the robot is the Actor in Figure 4.

303、服务器基于该外部信息和该多个动作分支，确定各个该动作分支的奖励值分布。303. The server determines the reward value distribution of each action branch based on the external information and the plurality of action branches.

其中，各个动作分支的奖励值分布是对在该外部信息的情况下，执行各个动作分支能够得到的潜在奖励分布，奖励的分布与智能体采取的动作、智能体之间的交互以及外部信息相关。Among them, the reward value distribution of each action branch is the potential reward distribution that can be obtained by executing each action branch in the case of the external information, and the reward distribution is related to the actions taken by the agent, the interaction between the agents and the external information .

在一种可能的实施方式中，服务器将该外部信息和该多个动作分支输入奖励值估计模型，通过该奖励值估计模型基于该外部信息和该多个动作分支进行奖励值分布估计，输出各个动作分支的奖励值分布。In a possible implementation manner, the server inputs the external information and the multiple action branches into the reward value estimation model, uses the reward value estimation model to estimate the reward value distribution based on the external information and the multiple action branches, and outputs each The reward value distribution of the action branch.

其中，参见图4，该奖励值估计模型也被称为多动作分支奖励估计(Multi-action-branch Reward Estimation)器，用于基于环境状态(外部信息)和动作(动作分支)来输出奖励值。Wherein, referring to FIG. 4, the reward value estimation model is also called a multi-action-branch reward estimation (Multi-action-branch Reward Estimation) device, which is used to output a reward value based on the environment state (external information) and action (action branch) .

举例来说，在每一个时间步t，对于每一个智能体(机器人)i，服务器使用多动作分支奖励估计器

去建模智能体i在观测到

的情况下选择第k个动作分支上的动作

可能存在的奖励的分布。其中，

表示奖励分布空间，

表示智能体i在第k个动作分支上的奖励分布，

表示智能体i的有关奖励分布的参数，

表示机器人i观测到的该外部信息。使用

表示在观测

下智能体i所有动作分支上估计的分布。进一步来说，可以通过优化目标函数实现多动作分支分布估计，进而实现捕捉奖励函数的不确定性，为接下来减小奖励不确定性带来的影响提供前置条件，目标函数的形式为下述公式(1)和公式(2)。For example, at each time step t, for each agent (robot) i, the server uses the multi-action branch reward estimator

to model agent i observing

In the case of choosing the action on the kth action branch

Distribution of possible rewards. in,

represents the reward distribution space,

Represents the reward distribution of agent i on the kth action branch,

Represents the parameters of agent i's reward distribution,

Indicates the external information observed by robot i. use

means observing

The estimated distribution over all action branches of the lower agent i. Furthermore, it is possible to estimate the distribution of multi-action branches by optimizing the objective function, thereby capturing the uncertainty of the reward function and providing preconditions for reducing the impact of reward uncertainty. The form of the objective function is as follows Formula (1) and Formula (2) above.

其中，公式(1)是所有智能体的多动作分支奖励估计器的优化函数，J_r为目标函数的函数值，在公式(2)中，省略了有关智能体编号的上标，D是经验缓冲池，-log P[.]表示负对数似然损失，

是关于奖励分布的正则化项。在一些实施例中，

可以采用高斯分布，但是优化目标其实并没有限制必须使用高斯分布，其他的分布形式也是可以的。当采用高斯分布的时候，

其中var(μ)表示关于高斯均值向量的方差。α和β是超参数。Among them, formula (1) is the optimization function of the multi-action branch reward estimator for all agents, _Jr is the function value of the objective function, in formula (2), the superscript of the agent number is omitted, and D is the experience buffer pool, -log P[.] for negative log-likelihood loss,

is a regularization term on the reward distribution. In some embodiments,

Gaussian distribution can be used, but the optimization objective does not limit the use of Gaussian distribution, and other distribution forms are also possible. When a Gaussian distribution is used,

where var(μ) represents the variance about the Gaussian mean vector. α and β are hyperparameters.

通过多动作分支奖励估计能够预测所有动作分支的潜在奖励，这就像人类会考虑每个可能出现的后果，通常会考虑所有可能的动作产生的后果再做出决定。因此，为了更好地评估历史经验、并利用历史经验实现稳定训练，在后面将会引入策略加权的奖励聚合，以削弱训练过程中奖励不确定性的影响。The multi-action branch reward estimation can predict the potential rewards of all action branches, just like human beings will consider every possible consequence, and usually consider the consequences of all possible actions before making a decision. Therefore, in order to better evaluate historical experience and use historical experience to achieve stable training, policy-weighted reward aggregation will be introduced later to weaken the impact of reward uncertainty during training.

304、服务器对在各个动作分支的奖励值分布中采样得到的采样奖励值进行奖励聚合，得到混合奖励和集成奖励。304. The server rewards and aggregates the sampled reward values sampled in the reward value distribution of each action branch to obtain a mixed reward and an integrated reward.

其中，在每一个时间步，智能体i只能采取某一动作a_k，获得这个动作分支上的奖励r_k，这里所说的动作分支也即是该动作a_k。但是经过上述多动作分支奖励估计之后，能够将r_k增强为嵌入奖励向量，嵌入奖励向量的第k个位置上的奖励值是r_k，其他位置上的值是从

中采样出来的奖励值。然后对嵌入奖励向量进行奖励聚合，获得两类奖励值用于更新集中式评论家网络(Centralized Critic)和分布式执行者网络(分布式执行者模型)。这两类奖励值分别是混合奖励(Mixed reward Aggregation)

它被用于训练集中式评论家V_γ，ψ；集成奖励(Lumped reward Aggregation)

它被用于训练分布式执行者

集中式评论家网络用于对分布式执行者网络输出的动作进行评价，该集中式评论家网络也被称为评论家模型。Among them, at each time step, the agent i can only take a certain action a _k to obtain the reward r _k on this action branch, and the action branch referred to here is the action a _k . However, after the above multi-action branch reward estimation, r _k can be enhanced into an embedding reward vector, the reward value at the kth position of the embedding reward vector is r _k , and the values at other positions are from

The reward value sampled from . Then reward aggregation is performed on the embedded reward vector, and two types of reward values are obtained for updating the centralized critic network (Centralized Critic) and the distributed executor network (distributed executor model). These two types of reward values are Mixed reward Aggregation

It is used to train the centralized critic V _{γ, ψ} ; integrated reward (Lumped reward Aggregation)

It is used to train distributed executors

The centralized critic network is used to evaluate the actions output by the distributed executor network, and the centralized critic network is also called the critic model.

首先对服务器获取该采样奖励值的方法进行说明。First, the method for the server to obtain the sampling reward value will be described.

在一种可能的实施方式中，服务器对各个动作分支的奖励值分布进行采样，得到各个动作分支的采样奖励值。In a possible implementation manner, the server samples the reward value distribution of each action branch to obtain the sampled reward value of each action branch.

下面分别对上述步骤304中服务器获取混合奖励和集成奖励的方法进行介绍。The method for the server to obtain the mixed reward and the integrated reward in step 304 above will be introduced respectively below.

首先对服务器获取混合奖励的方法进行介绍。First, the method for the server to obtain mixed rewards is introduced.

在一种可能的实施方式中，服务器对各个动作分支的采样奖励值进行策略加权融合，得到该混合奖励。In a possible implementation manner, the server performs policy-weighted fusion on the sampled reward values of each action branch to obtain the mixed reward.

其中，该多个动作分支的奖励值组成嵌入奖励向量

其中函数

表示一种替换操作，这个替换操作是将某个动作分支上的奖励替换为从环境获得的奖励，获得的最终结果就是

这个向量是通过将r_k放置在嵌入向量的第k个位置，其他的位置放上

对应的值。Among them, the reward values of the multiple action branches form the embedding reward vector

which function

Represents a replacement operation, this replacement operation is to replace the reward on an action branch with the reward obtained from the environment, and the final result obtained is

This vector is obtained by placing r _k in the kth position of the embedding vector, and the other positions are placed

corresponding value.

举例来说，服务器公式

获得经过策略加权之后的混合奖励

其中，函数g(.)表示两种操作，它们分别是：平均操作gMO：即直接对输入的量进行平均操作；选择操作g_SS：即直接选择和智能体i对应那个mⁱ作为g(.)的输出结果，

表示策略加权的权重，o表示外部信息。总结来说，服务器能够通过下述公式(3)来对该多个动作分支的奖励值进行策略加权融合，得到该混合奖励。For example, the server formula

Get mixed rewards after policy weighting

Among them, the function g(.) represents two operations, which are: average operation gMO: directly average the input amount; selection operation _gSS : directly select the mi corresponding to agent ⁱ as g(. ) output result,

Indicates the weight of policy weighting, and o indicates external information. In summary, the server can use the following formula (3) to carry out strategic weighted fusion of the reward values of the multiple action branches to obtain the mixed reward.

其中，

表示智能体i对应的混合奖励，m¹，…，m^N为智能体i的多个动作分支的奖励值，N为动作分支的数量，N为正整数。in,

Indicates the mixed rewards corresponding to agent i, m ¹ ,..., m ^N are the reward values of multiple action branches of agent i, N is the number of action branches, and N is a positive integer.

需要说明的是，可以采用上述平均操作g_MO和选择操作g_SS中的任一个操作来确定该混合奖励，本申请实施例对此不作限定。It should be noted that the mixed reward can be determined by any one of the above average operation g _MO and selection operation g _SS , which is not limited in this embodiment of the present application.

下面对服务器获取集成奖励的方法进行介绍。The following is an introduction to the method for the server to obtain integration rewards.

在一种可能的实施方式中，服务器对各个动作分支的采样奖励值进行全局平均、局部平均以及直接选取中的任一项，得到该集成奖励。In a possible implementation manner, the server performs any one of global average, local average and direct selection on the sampling reward values of each action branch to obtain the integrated reward.

举例来说，服务器通过下述公式(4)来获取该集成奖励。For example, the server obtains the integration reward through the following formula (4).

其中，

表示智能体i对应的集成奖励，函数l(.)有三种常见的表达形式：平均操作(全局平均)l_MO，直接对输入的所有嵌入奖励向量

求平均，获得的输出作为集成奖励

简化平均(局部平均)l_SMO，只选择和智能体i所对应的嵌入奖励向量mⁱ求平均作为集成奖励

选择操作(直接选取)l_SS，直接选择环境奖励反馈

作为输出。需要说明的是，可以采用上述任一种方式来获取该集成奖励，本申请实施例对此不作限定。in,

Indicates the integrated reward corresponding to the agent i, the function l(.) has three common expressions: the average operation (global average) l _MO , directly for all input embedded reward vectors

Averaging, the obtained output is used as the ensemble reward

Simplified average (local average) l _SMO , only select the embedding reward vector m ⁱ corresponding to agent i to average as the integrated reward

Select operation (direct selection) l _SS , directly select environment reward feedback

as output. It should be noted that the integration reward can be acquired in any of the above ways, which is not limited in this embodiment of the present application.

在一些实施例中，参见图4，服务器通过分布式奖励估计(Distributional RewardEstimation)网络中的奖励聚合(Reward Aggregation)器来获取该混合奖励和集成奖励。其中，上述步骤303中的多动作分支奖励估计(Multi-action-branch Reward Estimation)器也属于该分布式奖励估计网络。In some embodiments, referring to FIG. 4 , the server obtains the mixed reward and integrated reward through a reward aggregator (Reward Aggregation) in a distributed reward estimation (Distributional RewardEstimation) network. Wherein, the multi-action-branch reward estimation (Multi-action-branch Reward Estimation) device in the above step 303 also belongs to the distributed reward estimation network.

305、服务器基于该混合奖励、该外部信息和该智能决策模型的评论家模型，获取该多个动作分支的动作优势值，该评论家模型用于基于该外部信息对该多个动作分支进行评价。305. The server obtains the action advantage values of the multiple action branches based on the mixed reward, the external information, and the critic model of the intelligent decision-making model, and the critic model is used to evaluate the multiple action branches based on the external information .

其中，该评论家模型也即是上述集中式评论家网络。Wherein, the critic model is also the above-mentioned centralized critic network.

在一种可能的实施方式中，服务器基于该混合奖励训练该评论家模型。服务器将该外部信息和该多个动作分支输入该评论家模型，由该评论家模型输出该多个动作分支的动作优势值。In a possible implementation manner, the server trains the critic model based on the mixed reward. The server inputs the external information and the plurality of action branches into the critic model, and the critic model outputs action advantage values of the plurality of action branches.

为了对上述实施方式进行更加清楚地说明，下面将分为两个部分对上述实施方式进行说明。In order to describe the foregoing implementation manner more clearly, the description of the foregoing implementation manner will be described below in two parts.

第一部分、服务器基于该混合奖励和该外部信息训练该评论家模型。In the first part, the server trains the critic model based on the mixed reward and the external information.

在一种可能的实施方式中，服务器基于该混合奖励、该多个动作分支以及该机器人在下一个时间步采集到的该目标环境的外部信息，对该评论家模型进行训练。In a possible implementation manner, the server trains the critic model based on the mixed reward, the multiple action branches, and the external information of the target environment collected by the robot in the next time step.

举例来说，服务器将该混合奖励、该多个动作分支、该外部信息以及该机器人在下一个时间步采集到的该目标环境的外部信息输入该评论家模型，由该评论家模型对该混合奖励、该多个动作分支、该外部信息以及该机器人在下一个时间步采集到的该目标环境的外部信息进行编码，得到编码特征。服务器通过该评论家模型，对该编码特征进行聚合和至少一次全连接，输出对该多个动作分支的参考评价值和目标评价值，参考评价值用于对该多个动作分支以及该外部信息进行评价，该目标评价值用于对该多个动作分支以及该机器人在下一个时间步采集到的该目标环境的外部信息进行评价。服务器基于该目标评价值和该参考评价值之间的差异信息，对该评论家模型进行训练。For example, the server inputs the mixed reward, the multiple action branches, the external information, and the external information of the target environment collected by the robot at the next time step into the critic model, and the critic model gives the mixed reward , the plurality of action branches, the external information, and the external information of the target environment collected by the robot in the next time step are encoded to obtain encoded features. Through the critic model, the server aggregates the coding features and at least one full connection, outputs the reference evaluation value and target evaluation value of the multiple action branches, and the reference evaluation value is used for the multiple action branches and the external information Evaluation is performed, and the target evaluation value is used to evaluate the plurality of action branches and the external information of the target environment collected by the robot in the next time step. The server trains the critic model based on the difference information between the target evaluation value and the reference evaluation value.

比如，服务器通过最小化贝尔曼残差优化该评论家模型V_γ，ψ，也即是通过下述公式(5)来训练该评论家模型。For example, the server optimizes the critic model V _γ,ψ by minimizing the Bellman residual, that is, trains the critic model through the following formula (5).

其中，J_c(ψ)为优化函数的函数值，ψ和

分别表示当前状态值网络的参数和目标状态值网络的参数。V_γ，ψ为模型参数，γ表示动作分支，o′表示该机器人在下一个时间步采集到的该目标环境的外部信息。Among them, J _c (ψ) is the function value of the optimization function, ψ and

Respectively represent the parameters of the current state value network and the parameters of the target state value network. V _{γ, ψ} are model parameters, γ represents the action branch, and o' represents the external information of the target environment collected by the robot in the next time step.

第二部分、服务器将该外部信息和该多个动作分支输入该评论家模型，由该评论家模型输出该多个动作分支的动作优势值。In the second part, the server inputs the external information and the plurality of action branches into the critic model, and the critic model outputs action advantage values of the plurality of action branches.

其中，参见上述公式(5)，平方项中的

也即是动作优势值

但是我们不是使用混合奖励

计算动作优势值，而是使用集成奖励

即

Among them, see the above formula (5), in the square term

action advantage

But instead of using mixed rewards

Compute action advantage, use ensemble rewards instead

Right now

在一些实施例中，服务器基于该外部信息训练该智能决策模型的奖励值估计模型。In some embodiments, the server trains the reward value estimation model of the intelligent decision-making model based on the external information.

306、服务器基于该动作优势值和该集成奖励，对该智能决策模型进行训练。306. The server trains the intelligent decision-making model based on the action advantage value and the integrated reward.

在一种可能的实施方式中，服务器基于该动作优势值和该集成奖励，构成该分布式执行者模型的损失函数。服务器基于该损失函数，采用梯度下降法对该分布式执行者模型进行训练。In a possible implementation manner, the server forms a loss function of the distributed executor model based on the action advantage value and the integrated reward. Based on the loss function, the server uses the gradient descent method to train the distributed executor model.

在一些实施例中，该损失函数为和近端策略最优化(Proximal PolicyOptimization，PPO)类似的损失函数。In some embodiments, the loss function is a loss function similar to Proximal Policy Optimization (PPO).

举例来说，服务器通过下述公式(6)来对该分布式执行者模型进行训练，下述公式(6)省略了智能体索引的上标。For example, the server uses the following formula (6) to train the distributed actor model, and the following formula (6) omits the superscript of the agent index.

其中，

表示重要性权重，∈是裁剪超参数，在实际使用的过程中，一般选择

表示智能体i的策略熵，控制着熵在最终损失J_a(θ)中的重要性程度，动作优势值

与集成奖励

相关。in,

Indicates the importance weight, ∈ is the clipping hyperparameter, in the process of actual use, generally choose

Indicates the strategy entropy of agent i, which controls the importance of entropy in the final loss J _a (θ), and the action advantage value

with integrated rewards

relevant.

需要说明的是，上述步骤301-306是以对该分布式执行者模型进行一次迭代训练为例进行说明的，在其他迭代训练中，训练方法与上述步骤301-306同理，在此不再赘述。It should be noted that the above-mentioned steps 301-306 are described as an iterative training of the distributed executor model. In other iterative training, the training method is the same as the above-mentioned steps 301-306, and will not be repeated here. repeat.

另外，上述步骤301-306是以服务器作为执行主体为该智能决策模型进行训练为例进行说明，在其他可能的实施方式中，上述步骤301-306也可以由终端执行，本申请实施例对此不作限定。In addition, the above-mentioned steps 301-306 are described by taking the server as the execution subject to train the intelligent decision-making model as an example. In other possible implementation manners, the above-mentioned steps 301-306 can also be executed by the terminal. Not limited.

对该智能决策模型训练完成之后，可以将该分布式执行者模型部署到多个机器人中，将多个机器人放到指定环境中就能够实现对多个机器人的导航，比如，可以为该多个机器人设定任务，从指定环境中的A位置移动到B位置，指定环境中包括多种地形和障碍物。该多个机器人能够观察该指定环境，获取该指定环境的外部信息，将该外部信息输入训练完毕的智能决策模型，就能够得到对应的动作，比如，该动作包括直行、转弯、抬腿等，通过执行这些动作，多个机器人能够完成从A位置移动到B位置的任务。After the training of the intelligent decision-making model is completed, the distributed executor model can be deployed to multiple robots, and multiple robots can be placed in a specified environment to realize the navigation of multiple robots. The robot is given a task to move from position A to position B in a specified environment, which includes various terrains and obstacles. The multiple robots can observe the specified environment, obtain external information of the specified environment, and input the external information into the trained intelligent decision-making model to obtain corresponding actions. For example, the actions include going straight, turning, raising legs, etc. By performing these actions, multiple robots are able to accomplish the task of moving from location A to location B.

上述所有可选技术方案，可以采用任意结合形成本申请的可选实施例，在此不再一一赘述。All the above optional technical solutions may be combined in any way to form optional embodiments of the present application, which will not be repeated here.

在实验过程中，选取了包含三个协作的多智能体环境进行测试。这三个场景分别是CN、REF、TREA。其中前两个场景是粒子系统内的场景，第三个场景是基于粒子系统开发的更难版本的Cooperative Treasure Collection。对比的方法包括MADDPG、MAAC、QMIX、IQL、MAPPO，其中MAPPO是最新的使用集中化训练去中心化执行框架将PPO扩展到多智能体场景的算法。图5示出了实验结果，在图5中，在三个场景中采用本申请实施例提供的方法均达了最优的效果，说明本申请实施例提供的方法确实能够捕捉环境中奖励的不确定性，同时从性能曲线可以看出，本申请实施例提供的方法在学习过程中成功实现了稳定的训练过程。During the experiment, a multi-agent environment containing three collaborations was selected for testing. The three scenarios are CN, REF, and TREA. The first two scenes are scenes in the particle system, and the third scene is a more difficult version of the Cooperative Treasure Collection developed based on the particle system. Compared methods include MADDPG, MAAC, QMIX, IQL, and MAPPO, among which MAPPO is the latest algorithm that uses a centralized training decentralized execution framework to extend PPO to multi-agent scenarios. Fig. 5 shows the experimental results. In Fig. 5, the method provided by the embodiment of the present application achieves the best results in three scenarios, indicating that the method provided by the embodiment of the present application can indeed capture the different rewards in the environment. It can be seen from the performance curve that the method provided by the embodiment of the present application successfully realizes a stable training process in the learning process.

通过本申请实施例提供的技术方案，获取了机器人在目标环境中采集到的外部信息，将外部信息输入智能决策模型，由智能决策模型的分布式执行者模型输出多个动作分支，该多个动作分支均是在获取到该外部信息的情况下，该机器人在该目标环境中可能执行的动作。基于外部信息和该多个动作分支，确定各个动作分支的奖励值分布，也即是对多个动作分支均进行了评价。基于多个动作分支的奖励值分布，进行奖励聚合确定混合奖励和集成奖励。基于该混合奖励和该集成奖励以及外部信息对该智能决策模型进行训练，能够达到较为稳定的训练效果。Through the technical solution provided by the embodiment of this application, the external information collected by the robot in the target environment is obtained, the external information is input into the intelligent decision-making model, and the distributed executor model of the intelligent decision-making model outputs multiple action branches. The action branches are actions that the robot may perform in the target environment when the external information is obtained. Based on the external information and the multiple action branches, the distribution of the reward value of each action branch is determined, that is, the multiple action branches are all evaluated. Based on the distribution of reward values of multiple action branches, reward aggregation is performed to determine mixed rewards and integrated rewards. Training the intelligent decision-making model based on the mixed reward, the integrated reward and external information can achieve a relatively stable training effect.

本申请实施例提供一种多智能体分布式奖励估计与策略加权聚合框架(即图5中的DRE-MARL)，用于捕捉并降低奖励信号不确定带来的学习性能不稳定的问题，实现了模型训练的有效性和鲁棒性。在该框架内，统筹考虑在某一个状态下所有动作分支上的可能奖励信号，并为评论家(Critic)网络提供更加稳定的奖励信号供其更新模型参数。首先，提出分布式的多动作分支分布估计，为每一个动作分支构造一个单独的奖励分布函数。其次，从这些不同的动作分布上采样奖励信号，组成一个采样奖励向量，然后将其中的某个动作分支上的奖励替换为从环境获得的奖励，组成嵌入奖励向量，之后对嵌入奖励向量进行各种奖励聚合操作，获得混合奖励信号作为评论家网络和分布式执行者(DecentralizedActor)网络的奖励信号。实验结果显示，本申请实施例提供的方法表现出了优异的性能，证明了本申请实施例提供的方法的有效性。The embodiment of the present application provides a multi-agent distributed reward estimation and policy weighted aggregation framework (ie, DRE-MARL in Figure 5), which is used to capture and reduce the problem of unstable learning performance caused by uncertain reward signals, and realize It improves the effectiveness and robustness of model training. In this framework, the possible reward signals on all action branches in a certain state are considered as a whole, and a more stable reward signal is provided for the critic (Critic) network to update the model parameters. First, a distributed multi-action branch distribution estimation is proposed, and a separate reward distribution function is constructed for each action branch. Secondly, the reward signal is sampled from these different action distributions to form a sampled reward vector, and then the reward on one of the action branches is replaced with the reward obtained from the environment to form an embedded reward vector. A reward aggregation operation to obtain a mixed reward signal as the reward signal of the critic network and the distributed executor (DecentralizedActor) network. Experimental results show that the method provided in the embodiment of the present application shows excellent performance, which proves the effectiveness of the method provided in the embodiment of the present application.

图6是本申请实施例提供的一种智能决策模型的训练装置的结构示意图，参见图6，装置包括：外部信息获取模块601、动作预测模块602、奖励值预测模块603、奖励值聚合模块604以及训练模块605。Fig. 6 is a schematic structural diagram of a training device for an intelligent decision-making model provided by an embodiment of the present application. Referring to Fig. 6, the device includes: an external information acquisition module 601, an action prediction module 602, a reward value prediction module 603, and a reward value aggregation module 604 and a training module 605 .

外部信息获取模块601，用于获取机器人在目标环境中采集到的外部信息，该外部信息包括外部环境信息和交互信息，该外部环境信息为该机器人观察该目标环境得到的信息，该交互信息为该机器人与该目标环境中的其他机器人进行交互得到的信息。The external information acquisition module 601 is used to acquire the external information collected by the robot in the target environment, the external information includes external environment information and interaction information, the external environment information is the information obtained by the robot observing the target environment, and the interaction information is The robot interacts with other robots in the target environment.

动作预测模块602，用于将该外部信息输入该机器人的智能决策模型，由该智能决策模型的分布式执行者模型基于该外部信息进行预测，输出该机器人的多个动作分支，该动作分支为该机器人在该目标环境中可能执行的动作。The action prediction module 602 is used to input the external information into the intelligent decision-making model of the robot, and the distributed executor model of the intelligent decision-making model performs prediction based on the external information, and outputs multiple action branches of the robot, and the action branches are Actions that the robot may perform in the target environment.

奖励值预测模块603，用于基于该外部信息和该多个动作分支，确定各个该动作分支的奖励值分布。The reward value prediction module 603 is configured to determine the reward value distribution of each of the action branches based on the external information and the plurality of action branches.

奖励值聚合模块604，用于对在各个该动作分支的奖励值分布中采样得到的采样奖励值进行奖励聚合，得到混合奖励和集成奖励。The reward value aggregation module 604 is configured to perform reward aggregation on the sampled reward values sampled in the reward value distribution of each action branch to obtain mixed rewards and integrated rewards.

训练模块605，用于基于该混合奖励、该集成奖励以及该外部信息，对该机器人的智能决策模型进行训练。A training module 605, configured to train the robot's intelligent decision-making model based on the mixed reward, the integrated reward and the external information.

在一种可能的实施方式中，该动作预测模块602，用于由该智能决策模型的分布式执行者模型对该外部信息进行全连接、卷积以及注意力编码中的至少一项，得到该外部信息的外部信息特征。由该智能决策模型的分布式执行者模型对该外部信息特征进行全连接和归一化，输出该机器人的多个动作分支。In a possible implementation, the action prediction module 602 is configured to perform at least one of full connection, convolution, and attention encoding on the external information by the distributed actor model of the intelligent decision-making model to obtain the External information characteristics of external information. The distributed executor model of the intelligent decision-making model performs full connection and normalization on the external information features, and outputs multiple action branches of the robot.

在一种可能的实施方式中，该奖励值预测模块603，用于将该外部信息和该多个动作分支输入奖励值估计模型，通过该奖励值估计模型基于该外部信息和该多个动作分支进行奖励值分布估计，输出各个该动作分支的奖励值分布。In a possible implementation manner, the reward value prediction module 603 is configured to input the external information and the multiple action branches into the reward value estimation model, and the reward value estimation model is based on the external information and the multiple action branches Estimate the reward value distribution, and output the reward value distribution of each action branch.

在一种可能的实施方式中，该奖励值聚合模块604，用于对各个该动作分支的奖励值分布进行采样，得到各个该动作分支的采样奖励值。对该各个动作分支的采样奖励值进行策略加权融合，得到该混合奖励。对各个该动作分支的采样奖励值进行全局平均、局部平均以及直接选取中的任一项，得到该集成奖励。In a possible implementation manner, the reward value aggregation module 604 is configured to sample the reward value distribution of each action branch to obtain the sampled reward value of each action branch. The strategy weighted fusion is performed on the sampling reward values of each action branch to obtain the mixed reward. Perform any one of global average, local average, and direct selection on the sampling reward values of each action branch to obtain the integrated reward.

在一种可能的实施方式中，该训练模块605，用于基于该混合奖励、该外部信息和该智能决策模型的评论家模型，获取该多个动作分支的动作优势值，该评论家模型用于基于该外部信息对该多个动作分支进行评价。基于该外部信息训练该智能决策模型的奖励值估计模型。基于该动作优势值和该集成奖励，对该智能决策模型的分布式执行者模型进行训练。In a possible implementation manner, the training module 605 is configured to obtain action advantage values of the plurality of action branches based on the mixed reward, the external information and the critic model of the intelligent decision model, and the critic model uses The multiple action branches are evaluated based on the external information. A reward value estimation model of the intelligent decision-making model is trained based on the external information. Based on the action advantage value and the integrated reward, the distributed executor model of the intelligent decision-making model is trained.

在一种可能的实施方式中，该训练模块605，基于该混合奖励和该外部信息训练该智能决策模型的评论家模型。将该外部信息和该多个动作分支输入该评论家模型，由该评论家模型输出该多个动作分支的动作优势值。In a possible implementation manner, the training module 605 trains the critic model of the intelligent decision-making model based on the mixed reward and the external information. The external information and the plurality of action branches are input into the critic model, and the critic model outputs action advantage values of the plurality of action branches.

在一种可能的实施方式中，该训练模块605，用于基于该动作优势值和该集成奖励，构成该分布式执行者模型的损失函数。基于该损失函数，采用梯度下降法对该分布式执行者模型进行训练。In a possible implementation manner, the training module 605 is configured to form a loss function of the distributed actor model based on the action advantage value and the integrated reward. Based on the loss function, the distributed executor model is trained using the gradient descent method.

需要说明的是：上述实施例提供的智能决策模型的训练装置在训练该智能决策模型时，仅以上述各功能模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能模块完成，即将计算机设备的内部结构划分成不同的功能模块，以完成以上描述的全部或者部分功能。另外，上述实施例提供的智能决策模型的训练装置与智能决策模型的训练方法实施例属于同一构思，其具体实现过程详见方法实施例，这里不再赘述。It should be noted that: when the training device for the intelligent decision-making model provided by the above-mentioned embodiment trains the intelligent decision-making model, the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be assigned by different The functional modules are completed, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the functions described above. In addition, the intelligent decision-making model training device and the intelligent decision-making model training method embodiment provided by the above-mentioned embodiments belong to the same concept, and the specific implementation process thereof can be found in the method embodiment, and will not be repeated here.

本申请实施例提供了一种计算机设备，用于执行上述方法，该计算机设备可以实现为终端或者服务器，下面先对终端的结构进行介绍：An embodiment of the present application provides a computer device for performing the above method. The computer device can be implemented as a terminal or a server. The structure of the terminal is firstly introduced below:

图7是本申请实施例提供的一种终端的结构示意图。该终端700可以是：智能手机、平板电脑、笔记本电脑或台式电脑。终端700还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。FIG. 7 is a schematic structural diagram of a terminal provided by an embodiment of the present application. The terminal 700 may be: a smart phone, a tablet computer, a notebook computer or a desktop computer. The terminal 700 may also be called user equipment, portable terminal, laptop terminal, desktop terminal and other names.

通常，终端700包括有：一个或多个处理器701和一个或多个存储器702。Generally, the terminal 700 includes: one or more processors 701 and one or more memories 702 .

处理器701可以包括一个或多个处理核心，比如4核心处理器、8核心处理器等。处理器701可以采用DSP(Digital Signal Processing，数字信号处理)、FPGA(Field－Programmable Gate Array，现场可编程门阵列)、PLA(Programmable Logic Array，可编程逻辑阵列)中的至少一种硬件形式来实现。处理器701也可以包括主处理器和协处理器，主处理器是用于对在唤醒状态下的数据进行处理的处理器，也称CPU(Central ProcessingUnit，中央处理器)；协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中，处理器701可以在集成有GPU(Graphics Processing Unit，图像处理器)，GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中，处理器701还可以包括AI(Artificial Intelligence，人工智能)处理器，该AI处理器用于处理有关机器学习的计算操作。The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 701 can adopt at least one hardware form in DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish. The processor 701 may also include a main processor and a coprocessor, the main processor is a processor for processing data in the wake-up state, and is also called a CPU (Central Processing Unit, central processing unit); the coprocessor is used to Low-power processor for processing data in standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing content that needs to be displayed on the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is configured to process computing operations related to machine learning.

存储器702可以包括一个或多个计算机可读存储介质，该计算机可读存储介质可以是非暂态的。存储器702还可包括高速随机存取存储器，以及非易失性存储器，比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中，存储器702中的非暂态的计算机可读存储介质用于存储至少一个计算机程序，该至少一个计算机程序用于被处理器701所执行以实现本申请中方法实施例提供的智能决策模型的训练方法。Memory 702 may include one or more computer-readable storage media, which may be non-transitory. The memory 702 may also include high-speed random access memory, and non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 702 is used to store at least one computer program, and the at least one computer program is used to be executed by the processor 701 to implement the methods provided by the method embodiments in this application. Training methods for intelligent decision-making models.

在一些实施例中，终端700还可选包括有：外围设备接口703和至少一个外围设备。处理器701、存储器702和外围设备接口703之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口703相连。具体地，外围设备包括：射频电路704、显示屏705、摄像头组件706、音频电路707和电源708中的至少一种。In some embodiments, the terminal 700 may optionally further include: a peripheral device interface 703 and at least one peripheral device. The processor 701, the memory 702, and the peripheral device interface 703 may be connected through buses or signal lines. Each peripheral device can be connected to the peripheral device interface 703 through a bus, a signal line or a circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 704 , a display screen 705 , a camera assembly 706 , an audio circuit 707 and a power supply 708 .

外围设备接口703可被用于将I/O(Input/Output，输入/输出)相关的至少一个外围设备连接到处理器701和存储器702。在一些实施例中，处理器701、存储器702和外围设备接口703被集成在同一芯片或电路板上；在一些其他实施例中，处理器701、存储器702和外围设备接口703中的任意一个或两个可以在单独的芯片或电路板上实现，本实施例对此不加以限定。The peripheral device interface 703 may be used to connect at least one peripheral device related to I/O (Input/Output, input/output) to the processor 701 and the memory 702 . In some embodiments, the processor 701, memory 702 and peripheral device interface 703 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 701, memory 702 and peripheral device interface 703 or The two can be implemented on a separate chip or circuit board, which is not limited in this embodiment.

射频电路704用于接收和发射RF(Radio Frequency，射频)信号，也称电磁信号。射频电路704通过电磁信号与通信网络以及其他通信设备进行通信。射频电路704将电信号转换为电磁信号进行发送，或者，将接收到的电磁信号转换为电信号。可选地，射频电路704包括：天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。The radio frequency circuit 704 is configured to receive and transmit RF (Radio Frequency, radio frequency) signals, also called electromagnetic signals. The radio frequency circuit 704 communicates with the communication network and other communication devices through electromagnetic signals. The radio frequency circuit 704 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals. Optionally, the radio frequency circuit 704 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like.

显示屏705用于显示UI(User Interface，用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏705是触摸显示屏时，显示屏705还具有采集在显示屏705的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器701进行处理。此时，显示屏705还可以用于提供虚拟按钮和/或虚拟键盘，也称软按钮和/或软键盘。The display screen 705 is used to display a UI (User Interface, user interface). The UI can include graphics, text, icons, video, and any combination thereof. When the display screen 705 is a touch display screen, the display screen 705 also has the ability to collect touch signals on or above the surface of the display screen 705 . The touch signal can be input to the processor 701 as a control signal for processing. At this time, the display screen 705 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.

摄像头组件706用于采集图像或视频。可选地，摄像头组件706包括前置摄像头和后置摄像头。通常，前置摄像头设置在终端的前面板，后置摄像头设置在终端的背面。The camera assembly 706 is used to capture images or videos. Optionally, the camera component 706 includes a front camera and a rear camera. Usually, the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.

音频电路707可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波，并将声波转换为电信号输入至处理器701进行处理，或者输入至射频电路704以实现语音通信。Audio circuitry 707 may include a microphone and speakers. The microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 701 for processing, or input them to the radio frequency circuit 704 to realize voice communication.

电源708用于为终端700中的各个组件进行供电。电源708可以是交流电、直流电、一次性电池或可充电电池。The power supply 708 is used to supply power to various components in the terminal 700 . Power source 708 may be alternating current, direct current, disposable batteries, or rechargeable batteries.

在一些实施例中，终端700还包括有一个或多个传感器709。该一个或多个传感器709包括但不限于：加速度传感器710、陀螺仪传感器711、压力传感器712、光学传感器713以及接近传感器714。In some embodiments, the terminal 700 also includes one or more sensors 709 . The one or more sensors 709 include, but are not limited to: an acceleration sensor 710 , a gyro sensor 711 , a pressure sensor 712 , an optical sensor 713 and a proximity sensor 714 .

加速度传感器710可以检测以终端700建立的坐标系的三个坐标轴上的加速度大小。The acceleration sensor 710 can detect the acceleration on the three coordinate axes of the coordinate system established by the terminal 700 .

陀螺仪传感器711可以终端700的机体方向及转动角度，陀螺仪传感器711可以与加速度传感器710协同采集用户对终端700的3D动作。The gyro sensor 711 can measure the body direction and rotation angle of the terminal 700 , and the gyro sensor 711 can cooperate with the acceleration sensor 710 to collect 3D actions of the user on the terminal 700 .

压力传感器712可以设置在终端700的侧边框和/或显示屏705的下层。当压力传感器712设置在终端700的侧边框时，可以检测用户对终端700的握持信号，由处理器701根据压力传感器712采集的握持信号进行左右手识别或快捷操作。当压力传感器712设置在显示屏705的下层时，由处理器701根据用户对显示屏705的压力操作，实现对UI界面上的可操作性控件进行控制。The pressure sensor 712 may be disposed on a side frame of the terminal 700 and/or a lower layer of the display screen 705 . When the pressure sensor 712 is installed on the side frame of the terminal 700 , it can detect the user's grip signal on the terminal 700 , and the processor 701 performs left and right hand recognition or shortcut operation according to the grip signal collected by the pressure sensor 712 . When the pressure sensor 712 is disposed on the lower layer of the display screen 705, the processor 701 controls operable controls on the UI interface according to the user's pressure operation on the display screen 705.

光学传感器713用于采集环境光强度。在一个实施例中，处理器701可以根据光学传感器713采集的环境光强度，控制显示屏705的显示亮度。The optical sensor 713 is used to collect ambient light intensity. In one embodiment, the processor 701 may control the display brightness of the display screen 705 according to the ambient light intensity collected by the optical sensor 713 .

接近传感器714用于采集用户与终端700的正面之间的距离。The proximity sensor 714 is used to collect the distance between the user and the front of the terminal 700 .

本领域技术人员可以理解，图7中示出的结构并不构成对终端700的限定，可以包括比图示更多或更少的组件，或者组合某些组件，或者采用不同的组件布置。Those skilled in the art can understand that the structure shown in FIG. 7 does not constitute a limitation on the terminal 700, and may include more or less components than shown in the figure, or combine certain components, or adopt different component arrangements.

上述计算机设备还可以实现为服务器，下面对服务器的结构进行介绍：The above-mentioned computer equipment can also be realized as a server, and the structure of the server is introduced below:

图8是本申请实施例提供的一种服务器的结构示意图，该服务器800可因配置或性能不同而产生比较大的差异，可以包括一个或多个处理器(Central Processing Units，CPU)801和一个或多个的存储器802，其中，该一个或多个存储器802中存储有至少一条计算机程序，该至少一条计算机程序由该一个或多个处理器801加载并执行以实现上述各个方法实施例提供的方法。当然，该服务器800还可以具有有线或无线网络接口、键盘以及输入输出接口等部件，以便进行输入输出，该服务器800还可以包括其他用于实现设备功能的部件，在此不做赘述。FIG. 8 is a schematic structural diagram of a server provided by an embodiment of the present application. The server 800 may have relatively large differences due to different configurations or performances, and may include one or more processors (Central Processing Units, CPU) 801 and a or more memories 802, wherein at least one computer program is stored in the one or more memories 802, and the at least one computer program is loaded and executed by the one or more processors 801 to implement the methods provided by the above method embodiments method. Of course, the server 800 may also have components such as wired or wireless network interfaces, keyboards, and input and output interfaces for input and output, and the server 800 may also include other components for implementing device functions, which will not be repeated here.

在示例性实施例中，还提供了一种计算机可读存储介质，例如包括计算机程序的存储器，上述计算机程序可由处理器执行以完成上述实施例中的智能决策模型的训练方法。例如，该计算机可读存储介质可以是只读存储器(Read-Only Memory，ROM)、随机存取存储器(Random Access Memory，RAM)、只读光盘(Compact Disc Read-Only Memory，CD-ROM)、磁带、软盘和光数据存储设备等。In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory including a computer program, and the computer program can be executed by a processor to complete the intelligent decision-making model training method in the above-mentioned embodiment. For example, the computer-readable storage medium may be a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a read-only optical disc (Compact Disc Read-Only Memory, CD-ROM), Magnetic tapes, floppy disks, and optical data storage devices, etc.

在示例性实施例中，还提供了一种计算机程序产品或计算机程序，该计算机程序产品或计算机程序包括程序代码，该程序代码存储在计算机可读存储介质中，计算机设备的处理器从计算机可读存储介质读取该程序代码，处理器执行该程序代码，使得该计算机设备执行上述智能决策模型的训练方法。In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program includes program code, the program code is stored in a computer-readable storage medium, and the processor of the computer device reads from the computer The program code is read by reading the storage medium, and the processor executes the program code, so that the computer device executes the above-mentioned training method for the intelligent decision-making model.

在一些实施例中，本申请实施例所涉及的计算机程序可被部署在一个计算机设备上执行，或者在位于一个地点的多个计算机设备上执行，又或者，在分布在多个地点且通过通信网络互连的多个计算机设备上执行，分布在多个地点且通过通信网络互连的多个计算机设备可以组成区块链系统。In some embodiments, the computer programs involved in the embodiments of the present application can be deployed and executed on one computer device, or executed on multiple computer devices at one location, or distributed in multiple locations and communicated Executed on multiple computer devices interconnected by the network, multiple computer devices distributed in multiple locations and interconnected through a communication network can form a blockchain system.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成，也可以通过程序来指令相关的硬件完成，该程序可以存储于一种计算机可读存储介质中，上述提到的存储介质可以是只读存储器，磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above-mentioned embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium can be read-only memory, magnetic disk or optical disk and so on.

上述仅为本申请的可选实施例，并不用以限制本申请，凡在本申请的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above are only optional embodiments of the application, and are not intended to limit the application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the application shall be included in the protection scope of the application. within.

Claims

1. A method for training an intelligent decision model, the method comprising:

acquiring external information acquired by a robot in a target environment, wherein the external information comprises external environment information and interaction information, the external environment information is information obtained by observing the target environment by the robot, and the interaction information is information obtained by interacting the robot with other robots in the target environment;

inputting the external information into an intelligent decision model of the robot, performing prediction by a distributed executor model of the intelligent decision model based on the external information, and outputting a plurality of action branches of the robot, wherein the action branches are actions which can be executed by the robot in the target environment;

determining a reward value distribution of each action branch based on the external information and the plurality of action branches;

carrying out reward aggregation on the sampled reward values obtained by sampling in the reward value distribution of each action branch to obtain mixed reward and integrated reward;

training an intelligent decision model of the robot based on the hybrid reward, the integrated reward, and the external information.

2. The method of claim 1, wherein the predicting by the distributed actor model of the intelligent decision model based on the external information, outputting a plurality of action branches for the robot comprises:

performing at least one of full connection, convolution and attention coding on the external information by a distributed executor model of the intelligent decision model to obtain external information characteristics of the external information;

and fully connecting and normalizing the external information features by a distributed executor model of the intelligent decision model, and outputting a plurality of action branches of the robot.

3. The method of claim 1, wherein determining a reward value distribution for each of the action branches based on the external information and the plurality of action branches comprises:

inputting the external information and the action branches into a reward value estimation model, performing reward value distribution estimation through the reward value estimation model based on the external information and the action branches, and outputting reward value distribution of each action branch.

4. The method of claim 1, wherein the aggregating rewards of the sampled reward values sampled in the reward value distribution of each of the action branches, and wherein the obtaining of the hybrid reward and the integrated reward comprises:

sampling the reward value distribution of each action branch to obtain the sampling reward value of each action branch;

carrying out strategy weighted fusion on the sampling reward values of all the action branches to obtain the mixed reward;

and carrying out any one of global averaging, local averaging and direct selection on the sampled reward values of the action branches to obtain the integrated reward.

5. The method of claim 1, wherein training a smart decision model of the robot based on the hybrid reward, the integrated reward, and the external information comprises:

training a reward value estimation model of the intelligent decision model based on the external information;

obtaining action advantage values of the action branches based on the hybrid reward, the external information and a critic model of the intelligent decision model, wherein the critic model is used for evaluating the action branches based on the external information;

training a distributed actor model of the intelligent decision model based on the action advantage value and the integrated reward.

6. The method of claim 5, wherein obtaining the action advantage values for the plurality of action branches based on the hybrid reward, the external information, and a critic model of the intelligent decision model comprises:

training a critic model of the intelligent decision model based on the hybrid reward and the external information;

inputting the external information and the plurality of action branches into the critic model, and outputting action dominance values of the plurality of action branches by the critic model.

7. The method of claim 5, wherein training the intelligent decision model based on the action advantage value and the integrated reward comprises:

constructing a loss function for the distributed actor model based on the action dominance value and the integrated reward;

and training the distributed executor model by adopting a gradient descent method based on the loss function.

8. An apparatus for training an intelligent decision model, the apparatus comprising:

the external information acquisition module is used for acquiring external information acquired by the robot in a target environment, wherein the external information comprises external environment information and interaction information, the external environment information is information obtained by observing the target environment by the robot, and the interaction information is information obtained by interacting the robot with other robots in the target environment;

the action prediction module is used for inputting the external information into an intelligent decision-making model of the robot, performing prediction by a distributed executor model of the intelligent decision-making model based on the external information, and outputting a plurality of action branches of the robot, wherein the action branches are actions which are possibly executed by the robot in the target environment;

the reward value prediction module is used for determining reward value distribution of each action branch based on the external information and the action branches;

the reward value aggregation module is used for carrying out reward aggregation on the sampled reward values sampled in the reward value distribution of each action branch to obtain mixed rewards and integrated rewards;

and the training module is used for training the intelligent decision model of the robot based on the mixed reward, the integrated reward and the external information.

9. A computer device, characterized in that the computer device comprises one or more processors and one or more memories, in which at least one computer program is stored, which is loaded and executed by the one or more processors to implement the method of training of an intelligent decision model according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which at least one computer program is stored, which is loaded and executed by a processor to implement a method of training an intelligent decision model as claimed in any one of claims 1 to 7.