CN117236416A

CN117236416A - A large language model interaction method and device

Info

Publication number: CN117236416A
Application number: CN202311498497.XA
Authority: CN
Inventors: 胡斌; 赵晨阳; 张璞; 周子豪; 刘斌
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2023-12-15

Abstract

The invention discloses a large language model interaction method and device. The method proposes a new planner-coordinator-executor large language model interaction framework, in which the large language model serves as the planner and the intelligent agent serves as the executor. , the new coordinator can determine when to request communication with the planner, and convert the current observation data of the executor into a text string in the form of natural language that the planner can understand. The coordinator can use reinforcement learning based on invalid communication penalties Pre-training to implement optimal communication strategies. By implementing the optimal communication strategy, the present invention can significantly reduce the number of communications with the planner after being formally deployed to the test environment. At the same time, the coordinator can reduce dependence on the planner in scenarios where the planner is prone to errors, and in the face of emergencies Turning to the planner for help when the situation arises improves the safety of the executor and the success rate of the task.

Description

A large language model interaction method and device

技术领域Technical field

本发明涉及强化学习与自然语言处理领域，尤其涉及一种大语言模型交互方法和装置。The invention relates to the fields of reinforcement learning and natural language processing, and in particular to a large language model interaction method and device.

背景技术Background technique

大语言模型是一种人工智能模型，旨在理解和生成人类语言。它们在大量的文本数据上进行训练，包含数十亿的参数，从而可以学习到语言数据中的复杂模式。大语言模型在近几年取得了巨大的成功，OpenAI发布的ChatGPT更是引发了社会各界的广泛关注。A large language model is an artificial intelligence model designed to understand and generate human language. They are trained on large amounts of text data, containing billions of parameters, allowing them to learn complex patterns in language data. Large language models have achieved great success in recent years, and the ChatGPT released by OpenAI has attracted widespread attention from all walks of life.

部分研究借助大语言模型的通识知识和推理能力来辅助智能体的决策与规划，但是在智能体完成任务期间如何合理、高效地与大语言模型进行通信仍然是一个未被解决的开放课题。例如谷歌团队提出的SayCan方法虽然借助大语言模型相比传统方法更好地解决了机械臂运动控制问题，但该方法要求智能体在每个时刻都与大语言模型进行通信，由于大语言模型包含数十亿的参数量，智能体每次与大语言模型通信都会花费大量的时间与计算资源，如果智能体在执行任务过程中的每一步都与大语言模型通信，开销是非常大的。另外，当智能体遭遇意料之外的情况时，如果未及时求助于大语言模型，可能会导致安全性问题；例如智能体执行“去隔壁房间拿一杯水并返回”的任务时，一阵风吹过，将房门意外关上，如果机器人继续执行前进的动作，则会对自身和门造成损伤。而当大语言模型出错时，如果没有良好的纠错机制，同样也会导致任务无法完成，甚至出现安全性问题。Some studies use the general knowledge and reasoning capabilities of large language models to assist the decision-making and planning of agents. However, how to reasonably and efficiently communicate with large language models during the completion of tasks by agents is still an unsolved open topic. For example, although the SayCan method proposed by the Google team uses a large language model to solve the problem of robotic arm motion control better than traditional methods, this method requires the agent to communicate with the large language model at every moment, because the large language model contains With billions of parameters, each time the agent communicates with the large language model, it will spend a lot of time and computing resources. If the agent communicates with the large language model at every step in the execution of the task, the overhead will be very high. In addition, when the agent encounters an unexpected situation, if it does not turn to the large language model in time, it may lead to security problems; for example, when the agent performs the task of "go to the next room to get a glass of water and return", a gust of wind blows , the door is closed accidentally. If the robot continues to perform forward motion, it will cause damage to itself and the door. When a large language model makes an error, if there is no good error correction mechanism, it will also cause the task to be unable to be completed, and even security issues will arise.

常见的基于大语言模型指导的智能体系统，将整个控制过程分成一个基于大语言模型的专门在逻辑层面提供高级指令的规划者和一个基于预训练或预设的专门处理底层运动控制的执行者两部分。而本发明在原有的框架基础上，增加了协调者作为规划者和执行者之间的中介，以判断是否需要与规划者通信。协调者使用强化学习的方式最大化累积通信奖励——即通过最少的通信次数使智能体完成任务，以解决上述提到的智能体（执行者）与大语言模型（规划者）的通信问题。Common intelligent agent systems based on large language model guidance divide the entire control process into a planner based on a large language model that specifically provides high-level instructions at the logical level and an executor based on pre-training or presets that specifically handles low-level motion control. Two parts. On the basis of the original framework, the present invention adds a coordinator as an intermediary between the planner and the executor to determine whether communication with the planner is needed. The coordinator uses reinforcement learning to maximize the cumulative communication reward—that is, to enable the agent to complete the task through the minimum number of communications to solve the communication problem between the agent (executor) and the large language model (planner) mentioned above.

发明内容Contents of the invention

本发明的目的在于针对现有技术中的不足，提供一种大语言模型交互方法和装置。The purpose of the present invention is to provide a large language model interaction method and device to address the deficiencies in the existing technology.

本发明的目的是通过以下技术方案来实现的：本发明实施例第一方面提供了一种大语言模型交互方法，包括以下步骤：The object of the present invention is achieved through the following technical solutions: The first aspect of the embodiment of the present invention provides a large language model interaction method, which includes the following steps:

（1）执行者与环境交互后，将当前采集到的观测数据发送给协调者；(1) After the executor interacts with the environment, it sends the currently collected observation data to the coordinator;

（2）协调者根据接收到的观测数据采用最优通信策略判断是否需要与规划者通信，若协调者需要与规划者通信，则协调者将观测数据转化为标准形式数据，并将该标准形式数据发送给规划者；若协调者不需要与规划者通信，则协调者将当前高级指令重新发送给执行者，并跳至步骤（4）；(2) The coordinator uses the optimal communication strategy based on the received observation data to determine whether it needs to communicate with the planner. If the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and converts the standard form into The data is sent to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instructions to the executor and jumps to step (4);

（3）规划者接收到标准形式数据后，基于该标准形式数据生成执行动作所对应的新的高级指令，并将其发送给执行者；(3) After receiving the standard form data, the planner generates new high-level instructions corresponding to the execution actions based on the standard form data and sends them to the executor;

（4）执行者接收到高级指令后，根据当前的观测数据并调用与该高级指令对应的底层控制逻辑以执行对应的执行动作。(4) After the executor receives the high-level instruction, it calls the underlying control logic corresponding to the high-level instruction based on the current observation data to perform the corresponding execution action.

进一步地，所述观测数据包括传感器数据、文本数据和图像数据。Further, the observation data includes sensor data, text data and image data.

进一步地，所述最优通信策略具体包括：协调者决定每一时刻是否需要与规划者通信以通过最少的通信次数使得执行者完成任务，将该过程定义为强化学习过程，该强化学习过程具体包括：对应于每个时刻的状态，协调者有两种不同的动作-坚持执行当前计划或向规划者请求新计划，在环境给予的奖励基础上，引入无效通信惩罚作为累积通信奖励；其中状态为执行者采集到的观测数据；通过最大化累积通信奖励以获取协调者的最优通信策略。Further, the optimal communication strategy specifically includes: the coordinator decides whether to communicate with the planner at each moment to enable the executor to complete the task through the minimum number of communications. This process is defined as a reinforcement learning process, and the reinforcement learning process specifically Including: corresponding to the state at each moment, the coordinator has two different actions-insisting on executing the current plan or requesting a new plan from the planner. Based on the rewards given by the environment, an invalid communication penalty is introduced as a cumulative communication reward; where the state Observation data collected for the executor; the coordinator's optimal communication strategy is obtained by maximizing the cumulative communication reward.

进一步地，所述累积通信奖励的表达式为：Further, the expression of the cumulative communication reward is:

； ;

其中，为t时刻的累积通信奖励，/>为t时刻由环境给予的奖励，/>为示性函数，为t时刻协调者的动作，ask表示协调者需要与规划者通信，not ask表示协调者不需要与规划者通信，/>为t时刻规划者返回的高级指令，/>为奖励折扣系数，/>为无效通信惩罚系数。in, is the accumulated communication reward at time t,/> is the reward given by the environment at time t,/> is an indicator function, is the action of the coordinator at time t, ask means that the coordinator needs to communicate with the planner, not ask means that the coordinator does not need to communicate with the planner,/> High-level instructions returned by the planner at time t,/> is the reward discount coefficient,/> is the penalty coefficient for invalid communication.

进一步地，所述最大化累积通信奖励的训练方法包括近端策略优化方法、最大熵演员-评论家方法、深度Q网络和优势演员-评论家方法。Further, the training methods for maximizing cumulative communication rewards include proximal policy optimization methods, maximum entropy actor-critic methods, deep Q networks and dominant actor-critic methods.

进一步地，所述执行者包括多个智能体；所述规划者包括大语言模型以及用户与大语音模型协同。Further, the executor includes multiple intelligent agents; the planner includes a large language model and users collaborate with the large speech model.

进一步地，所述高级指令与所述底层控制逻辑一一对应，所述底层控制逻辑与所述执行动作一一对应，所述高级指令与所述执行动作一一对应。Further, the high-level instructions have a one-to-one correspondence with the underlying control logic, the underlying control logic has a one-to-one correspondence with the execution actions, and the high-level instructions have a one-to-one correspondence with the execution actions.

本发明实施例第二方面提供了一种大语言模型交互装置，用于实现上述的大语言模型交互方法，包括：The second aspect of the embodiment of the present invention provides a large language model interaction device for implementing the above large language model interaction method, including:

规划者模块，用于根据接收到的标准形式数据生成执行动作所对应的新的高级指令；The planner module is used to generate new high-level instructions corresponding to execution actions based on the received standard form data;

协调者模块，用于根据观测数据采用最优通信策略判断是否需要与规划者通信，若协调者需要与规划者通信，则协调者将观测数据转化为标准形式数据，并将该标准形式数据发送给规划者；若协调者不需要与规划者通信，则协调者将当前高级指令重新发送给执行者；和The coordinator module is used to use the optimal communication strategy based on the observation data to determine whether communication with the planner is needed. If the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data. to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instructions to the executor; and

执行者模块，用于采集观测数据，并在接收到高级指令后，调用与该高级指令对应的底层控制逻辑以执行对应的执行动作。The executor module is used to collect observation data, and after receiving high-level instructions, call the underlying control logic corresponding to the high-level instructions to perform the corresponding execution actions.

本发明实施例第三方面提供了一种电子设备，包括存储器和处理器，所述存储器与所述处理器耦接；其中，所述存储器用于存储程序数据，所述处理器用于执行所述程序数据以实现上述的大语言模型交互方法。A third aspect of the embodiment of the present invention provides an electronic device, including a memory and a processor, the memory is coupled to the processor; wherein the memory is used to store program data, and the processor is used to execute the Program data to implement the large language model interaction method described above.

本发明实施例第四方面提供了一种计算机可读存储介质，其上存储有计算机程序，所述程序被处理器执行时实现上述的大语言模型交互方法。The fourth aspect of the embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the above-mentioned large language model interaction method is implemented.

本发明的有益效果是，本发明通过提出一种新的交互框架，通过新增一个协调者，将新增的协调者作为连接大语言模型与智能体的中介，有效降低了通信带来的时间成本和计算资源成本；同时，引入协调者后，更有助于智能体在面对突发情况时及时求助于大语言模型，以及在大语言模型容易出错的场景下减少对大语言模型的依赖，提高了智能体的安全性与任务成功率。The beneficial effect of the present invention is that by proposing a new interaction framework and adding a new coordinator, the new coordinator is used as an intermediary to connect the large language model and the intelligent agent, thereby effectively reducing the time caused by communication. cost and computing resource cost; at the same time, the introduction of the coordinator will help the agent to turn to large language models in a timely manner when facing emergencies, and reduce the dependence on large language models in scenarios where large language models are prone to errors. , improving the safety and mission success rate of the agent.

附图说明Description of drawings

图1是强化学习的流程示意图；Figure 1 is a schematic flow chart of reinforcement learning;

图2是大语言模型交互方法的总体流程示意图；Figure 2 is a schematic diagram of the overall flow of the large language model interaction method;

图3是单智能体系统中应用大语言模型交互方法的一种实施例的通信示意图；Figure 3 is a communication schematic diagram of an embodiment of applying a large language model interaction method in a single-agent system;

图4是多智能体系统中应用大语言模型交互方法的一种实施例的通信示意图；Figure 4 is a communication schematic diagram of an embodiment of applying a large language model interaction method in a multi-agent system;

图5是多智能体系统中应用大语言模型交互方法的另一种实施例的通信示意图；Figure 5 is a communication schematic diagram of another embodiment of applying a large language model interaction method in a multi-agent system;

图6是多智能体系统中应用大语言模型交互方法的另一种实施例的通信示意图；Figure 6 is a communication schematic diagram of another embodiment of applying a large language model interaction method in a multi-agent system;

图7是大语言模型交互方法的有效性在模拟智能工厂的MiniGrid环境的实验验证结果示意图；其中，图7中的（a）是大语言模型交互方法的任务成功率在模拟智能工厂的MiniGrid环境的实验验证结果示意图，图7中的（b）大语言模型交互方法的通信成本在模拟智能工厂的MiniGrid环境的实验验证结果示意图；Figure 7 is a schematic diagram of the experimental verification results of the effectiveness of the large language model interaction method in the MiniGrid environment that simulates a smart factory; (a) in Figure 7 is the task success rate of the large language model interaction method in the MiniGrid environment that simulates a smart factory. A schematic diagram of the experimental verification results, Figure 7 (b) A schematic diagram of the experimental verification results of the communication cost of the large language model interaction method in the MiniGrid environment simulating a smart factory;

图8是大语言模型交互装置的一种结构示意图。Figure 8 is a structural schematic diagram of a large language model interaction device.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the appended claims.

在本发明使用的术语是仅仅出于描述特定实施例的目的，而非旨在限制本发明。在本发明和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式，除非上下文清楚地表示其他含义。还应当理解，本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this disclosure and the appended claims, the singular forms "a," "the" and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

应当理解，尽管在本发明可能采用术语第一、第二、第三等来描述各种信息，但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如，在不脱离本发明范围的情况下，第一信息也可以被称为第二信息，类似地，第二信息也可以被称为第一信息。取决于语境，如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in the present invention to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of the present invention, the first information may also be called second information, and similarly, the second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "when" or "when" or "in response to determining."

下面结合附图，对本发明进行详细说明。在不冲突的情况下，下述的实施例及实施方式中的特征可以相互组合。The present invention will be described in detail below with reference to the accompanying drawings. Features in the following embodiments and implementations may be combined with each other without conflict.

参见图2，本发明的大语言模型交互方法具体包括以下步骤：Referring to Figure 2, the large language model interaction method of the present invention specifically includes the following steps:

（1）执行者与环境交互后，将当前采集到的观测数据发送给协调者。(1) After the executor interacts with the environment, it sends the currently collected observation data to the coordinator.

进一步地，观测数据包括但不限于：传感器数据、文本数据、图像数据等。Further, observation data includes but is not limited to: sensor data, text data, image data, etc.

需要说明的是，执行者包括多个智能体，例如机器人、红绿灯等，可以基于预训练或预设的底层运动控制执行一定的动作，以完成所需执行的任务。应当理解的是，在某个系统中执行任务时，执行者可以是单个智能体，如执行者是机器人或红绿灯；执行者也可以是多个智能体，如执行者是多个信号灯、路边监控和自动驾驶汽车。It should be noted that the executors include multiple intelligent agents, such as robots, traffic lights, etc., which can perform certain actions based on pre-trained or preset underlying motion control to complete the required tasks. It should be understood that when performing tasks in a certain system, the executor can be a single agent, such as a robot or a traffic light; the executor can also be multiple agents, such as multiple traffic lights, roadside lights, etc. Surveillance and self-driving cars.

具体地，当执行者在执行任务时，例如机器人执行通过一扇门的任务时，机器人在移动过程中，即与环境交互后，需要通过各种传感器以及机器人携带的图像采集装置等采集当前的观测数据，机器人将采集到的观测数据发送给协调者，之后协调者可以根据这些观测数据即可判断出机器人是直接通过该扇门还是打开门再通过。Specifically, when the performer is performing a task, for example, when the robot performs the task of passing through a door, the robot needs to collect the current image through various sensors and the image acquisition device carried by the robot during the movement, that is, after interacting with the environment. Observation data, the robot will send the collected observation data to the coordinator, and then the coordinator can judge based on these observation data whether the robot will directly pass through the door or open the door and then pass.

（2）协调者根据接收到的观测数据采用最优通信策略判断是否需要与规划者通信，若协调者需要与规划者通信，则协调者将观测数据转化为标准形式数据，并将该标准形式数据发送给规划者；若协调者不需要与规划者通信，则协调者将当前高级指令重新发送给执行者，并跳至步骤（4）。(2) The coordinator uses the optimal communication strategy based on the received observation data to determine whether it needs to communicate with the planner. If the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and converts the standard form into The data is sent to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instructions to the executor and jumps to step (4).

需要说明的是，标准形式数据是规划者可以理解的自然语言，可以通过YOLO目标检测算法或CLIP多模态算法将观测数据转化为标准形式数据，即规划者可以理解的自然语言。其中，标准形式数据可以是文本字符串，也可以是图片，例如大语言模型-GPT4既可以支持图片输入，同时也可以支持文本输入。It should be noted that standard form data is a natural language that planners can understand. Observation data can be converted into standard form data, that is, a natural language that planners can understand, through the YOLO target detection algorithm or CLIP multi-modal algorithm. Among them, the standard form data can be a text string or a picture. For example, the large language model-GPT4 can support both picture input and text input.

本实施例中，最优通信策略具体包括：协调者决定每一时刻是否需要与规划者通信以通过最少的通信次数使得执行者完成任务，将该过程定义为强化学习过程，该强化学习过程具体包括：对应于每个时刻的状态，协调者有两种不同的动作-坚持执行当前计划或向规划者请求新计划，在环境给予的奖励基础上，引入无效通信惩罚作为累积通信奖励；其中状态为执行者采集到的观测数据；通过最大化累积通信奖励以获取协调者的最优通信策略。In this embodiment, the optimal communication strategy specifically includes: the coordinator decides whether to communicate with the planner at each moment to enable the executor to complete the task through the minimum number of communications. This process is defined as a reinforcement learning process. The reinforcement learning process specifically Including: corresponding to the state at each moment, the coordinator has two different actions-insisting on executing the current plan or requesting a new plan from the planner. Based on the rewards given by the environment, an invalid communication penalty is introduced as a cumulative communication reward; where the state Observation data collected for the executor; the coordinator's optimal communication strategy is obtained by maximizing the cumulative communication reward.

应当理解的是，强化学习是机器学习的一个子领域，讨论的问题是智能体如何在环境中行动，以获取最大化的累积奖励。强化学习主要由智能体、环境、状态、动作、奖励组成，如图1所示，智能体执行了某个动作后，环境将会转换到一个新的状态，对于该新的状态环境会给出奖励，随后，智能体根据新的状态和环境反馈的奖励，按照预设的策略执行新的动作。It should be understood that reinforcement learning is a subfield of machine learning, and the problem discussed is how an agent acts in the environment to maximize the cumulative reward. Reinforcement learning mainly consists of an agent, environment, state, action, and reward. As shown in Figure 1, after the agent performs an action, the environment will transition to a new state, and the new state environment will give Reward, and then the agent performs new actions according to the preset strategy based on the new state and the reward fed back by the environment.

进一步地，环境给予的奖励代表任务的完成情况，在此基础上，引入无效通信惩罚，无效通信惩罚具体包括：当协调者决定与规划者通信，且当前时刻与规划者的通信所返回的高级指令与上一时刻通信所返回的高级指令相同，则说明此时无需与规划者进行通信，协调者会接收到一个负反馈。Furthermore, the reward given by the environment represents the completion of the task. On this basis, the invalid communication penalty is introduced. The invalid communication penalty specifically includes: when the coordinator decides to communicate with the planner, and the high-level communication returned by the planner at the current moment If the instruction is the same as the high-level instruction returned by the communication at the previous moment, it means that there is no need to communicate with the planner at this time, and the coordinator will receive a negative feedback.

进一步地，累积通信奖励的表达式为：Further, the expression of cumulative communication reward is:

应当理解的是，协调者的动作为“通信”时，说明此时协调者向规划者请求新计划；协调者的动作为“不通信”时，说明此时协调者坚持执行当前计划。It should be understood that when the coordinator's action is "communication", it means that the coordinator requests a new plan from the planner at this time; when the coordinator's action is "no communication", it means that the coordinator insists on executing the current plan at this time.

进一步地，通过近端策略优化方法（PPO）、最大熵演员-评论家方法（SAC）、深度Q网络（DQN）或优势演员-评论家方法（A2C）等经典的强化学习方法最大化累积通信奖励，即可得到协调者的最优通信策略。在正式部署到测试环境后，协调者接受执行者发送的状态（即观测数据），通过调用该最优通信策略，输出是否与规划者通信的动作，一方面显著减少与大语言模型的通信次数，降低通信带来的时间成本和计算资源成本，另一方面更有助于智能体在面对突发情况时及时求助于大语言模型，以及在大语言模型容易出错的场景下减少对大语言模型的依赖，提高了智能体的安全性与任务成功率。Furthermore, the cumulative communication is maximized through classic reinforcement learning methods such as proximal policy optimization method (PPO), maximum entropy actor-critic method (SAC), deep Q network (DQN) or dominant actor-critic method (A2C) As a reward, you can get the coordinator's optimal communication strategy. After being formally deployed to the test environment, the coordinator accepts the status (i.e., observation data) sent by the executor, and by calling the optimal communication strategy, outputs the action of whether to communicate with the planner. On the one hand, it significantly reduces the number of communications with the large language model. , reducing the time cost and computing resource cost caused by communication. On the other hand, it helps the agent to turn to large language models in a timely manner when facing emergencies, and reduces the need for large language models in scenarios where large language models are prone to errors. The dependence of the model improves the safety and task success rate of the agent.

应当理解的是，近端策略优化方法是一种经典的强化学习方法，可以通过多个训练步骤实现小批量的更新，能够较为容易得确定步长，通过观测数据选择一个行为进行反向传播，利用奖励直接对选择行为的可能性进行增强和减弱，好的行为会被增加下一次被选中的概率，不好的行为会被减弱下一次被选中的概率。最大熵演员-评论家方法是一种经典的强化学习方法，通过演员和评论家两部分协同，最终指导智能体选择最优行为，演员同时最大化期望和策略分布的熵，并且在选取最优行为的同时保证行为策略的随机性，高效且稳定，对不同环境具有更强的鲁棒性。It should be understood that the proximal policy optimization method is a classic reinforcement learning method, which can achieve small-batch updates through multiple training steps, can easily determine the step size, and select a behavior for backpropagation through observation data. Use rewards to directly enhance and weaken the possibility of choosing a behavior. Good behavior will increase the probability of being selected next time, and bad behavior will weaken the probability of being selected next time. The maximum entropy actor-critic method is a classic reinforcement learning method. Through the collaboration of actors and critics, it ultimately guides the agent to select the optimal behavior. The actor simultaneously maximizes the entropy of the expectation and strategy distribution, and selects the optimal behavior. It ensures the randomness of the behavior strategy while acting, is efficient and stable, and is more robust to different environments.

（3）规划者接收到标准形式数据后，基于该标准形式数据生成执行动作所对应的新的高级指令，并将其发送给执行者。(3) After receiving the standard form data, the planner generates new high-level instructions corresponding to the execution actions based on the standard form data and sends them to the executor.

规划者包括但不限于大语言模型以及用户与大语音模型协同。应当理解的是，规划者可以是大语言模型，通过神经网络对自然语言进行编码与解码实现高级指令的输出；也可以由技术专家人为介入，如智慧交通系统中的交警；或由大语言模型给出备选项并由用户根据偏好选择。Planners include but are not limited to large language models and user collaboration with large speech models. It should be understood that the planner can be a large language model that encodes and decodes natural language through a neural network to achieve the output of high-level instructions; it can also be manually intervened by technical experts, such as traffic police in smart transportation systems; or by a large language model Alternatives are given and chosen by the user based on preference.

具体地，规划者接收到的标准形式数据为“观测到牛奶被洒到了桌面上”时，基于该标准形式数据可以确定需要执行者执行的动作为“控制机械臂将桌面擦干净”，规划者生成该动作所对应的高级指令，然后再将该高级指令发送给执行者。Specifically, when the standard form data received by the planner is "observed milk being spilled on the table", based on the standard form data it can be determined that the action that needs to be performed by the executor is "control the robotic arm to wipe the table clean", the planner Generate the high-level instruction corresponding to the action, and then send the high-level instruction to the executor.

具体地，执行者有自身对应的底层控制逻辑，底层控制逻辑与高级指令一一对应，底层控制逻辑与执行动作一一对应，高级指令与执行动作一一对应，如此，执行者在接收到高级指令后，会调用与高级指令相对应的底层控制逻辑，从而执行相对应的执行动作。底层控制逻辑可以是预训练好的，也可以是预设的。执行者在接收到来自于规划者发送的新的高级指令后，会执行新高级指令对应的新执行动作；执行者在接收到来自于协调者发送的当前高级指令后，会继续执行当前高级指令对应的预训练好的或预设的执行动作。Specifically, the executor has its own corresponding underlying control logic. The underlying control logic corresponds to the high-level instructions one-to-one. The underlying control logic corresponds to the execution actions one-to-one. The high-level instructions correspond to the execution actions one-to-one. In this way, the executor receives the high-level instructions. After the instruction, the underlying control logic corresponding to the high-level instruction will be called to perform the corresponding execution action. The underlying control logic can be pre-trained or preset. After the executor receives the new high-level instruction sent from the planner, it will execute the new execution action corresponding to the new high-level instruction; after the executor receives the current high-level instruction sent from the coordinator, it will continue to execute the current high-level instruction. Corresponding pre-trained or preset execution actions.

下面根据实施例详细描述本发明的大语言模型交互方法，本发明的目的和效果将变得更加明显。The large language model interaction method of the present invention is described in detail below based on the embodiments, and the purpose and effect of the present invention will become more obvious.

实施例1Example 1

如图3所示，显示了在单智能体系统中应用大语言模型交互方法的一种实施例的通信示意图。As shown in Figure 3, a communication schematic diagram of an embodiment of applying a large language model interaction method in a single-agent system is shown.

具体地，以智慧工厂的搬运机器人的某个工作状态为例（实际应用场景包括但不限于搬运机器人），此时，搬运机器人为执行者，大语言模型为规划者。该应用于单智能体系统中的大语言模型交互方法，包括以下步骤：Specifically, take a certain working state of a handling robot in a smart factory as an example (actual application scenarios include but are not limited to handling robots). At this time, the handling robot is the executor and the large language model is the planner. The large language model interaction method applied in a single-agent system includes the following steps:

（1）搬运机器人前往下一个需要搬运的物体的路途中，将摄像头采集到的图像数据等观测数据发送给协调者。(1) On its way to the next object that needs to be transported, the handling robot sends observation data such as image data collected by the camera to the coordinator.

（2）协调者接收到搬运机器人发送的观测数据，并采用最优通信策略判断是否需要与规划者通信。如果搬运机器人摄像头采集到的图像中没有需要搬运的物体或距离需要搬运的物体较远时，说明协调者不需要与大语言模型通信，则协调者将当前高级指令重新发送给搬运机器人，让搬运机器人继续按执行原来预设的“移动”动作。如果搬运机器人足够靠近需要搬运的物体，说明协调者需要与规划者通信，则会将搬运机器人摄像头采集到的图像数据等观测数据转化为标准形式数据，即大语言模型可以理解的自然语言“物体的高度XX，宽度XX，与搬运机器人的相对位置是XX”，并输出给大语言模型。(2) The coordinator receives the observation data sent by the handling robot and uses the optimal communication strategy to determine whether it needs to communicate with the planner. If there is no object that needs to be moved in the image collected by the handling robot's camera or it is far away from the object that needs to be moved, it means that the coordinator does not need to communicate with the large language model, and the coordinator will resend the current high-level instructions to the handling robot to allow the handling The robot continues to perform the original preset "move" action. If the handling robot is close enough to the object that needs to be moved, it means that the coordinator needs to communicate with the planner, and the observation data such as image data collected by the handling robot's camera will be converted into standard form data, that is, a natural language "object" that can be understood by the large language model The height XX, width XX, and the relative position of the handling robot are XX”, and are output to the large language model.

（3）大语言模型接收到标准形式数据后，基于标准形式数据生成执行动作所对应的新的高级指令，即自然语言形式的高级指令“转动XX角度对准物体后，双手以XX间距向前伸出XX距离并抱住物体，将物体抬高XX高度后将双手收回胸前”，再将该新的高级指令下达给搬运机器人。(3) After receiving the standard form data, the large language model generates new high-level instructions corresponding to the execution of the action based on the standard form data, that is, the high-level instruction in the form of natural language "After turning XX angle to align the object, both hands move forward at XX distance Stretch out XX distance and hug the object, raise the object to XX height and then put your hands back to your chest", and then give this new high-level command to the handling robot.

（4）搬运机器人接收到高级指令后，会调用与自然语言形式的高级指令相对应的预训练或预设的底层控制逻辑“转动方向”、“伸出双手”、“抬高物体”、“收回双手”等，来从底层的机械控制层面完成这些指令，完成对应的执行动作。(4) After receiving high-level instructions, the handling robot will call the pre-trained or preset underlying control logic corresponding to the high-level instructions in the form of natural language, such as "rotation direction", "stretch out hands", "raise objects", " "Retract your hands" and so on, to complete these instructions from the underlying mechanical control level and complete the corresponding execution actions.

实施例2Example 2

如图4所示，显示了多智能体系统应用大语言模型交互方法的一种实施例的通信示意图。As shown in Figure 4, a communication schematic diagram of an embodiment of applying a large language model interaction method to a multi-agent system is shown.

具体地，以智慧交通系统的某个工作状态为例（实际应用场景包括但不限于智慧交通系统），在该系统中，执行者包括多个信号灯、路边监控和自动驾驶汽车，规划者为大语言模型和交警的协同。该应用于多智能体系统中的大语言模型交互方法，包括以下步骤：Specifically, taking a certain working state of the smart transportation system as an example (actual application scenarios include but are not limited to smart transportation systems), in this system, the executors include multiple traffic lights, roadside monitoring and autonomous vehicles, and the planner is Collaboration between large language models and traffic police. The large language model interaction method applied in multi-agent systems includes the following steps:

（1）多个路边监控和自动驾驶汽车的车载摄像头显示某路段出现事故，将当前采集到的观测数据发送给协调者。(1) Multiple roadside surveillance and on-board cameras of autonomous vehicles show that an accident occurred on a certain road section, and the currently collected observation data is sent to the coordinator.

（2）协调者接收并汇总多个执行者发送过来的观测数据，并判断是否需要与规划者通信。如果协调者对多个路边监控和自动驾驶汽车采集到的观测数据进行分析后，判断该起事故对该路段通行状况影响较小，说明协调者不需要与规划者通信，则会让所有执行者继续按原计划执行动作。如果协调者通过观测数据判断该起事故会导致该路段堵塞，有必要疏通堵塞车流或引导车流前往周边空闲路段，说明协调者需要与规划者通信，则会将多个执行者采集到观测数据综合转化为标准形式数据，即规划者可以理解的自然语言“XX路段XX车道发生事故，有XX辆机动车被堵塞”，并输出给大语言模型和交警。(2) The coordinator receives and summarizes the observation data sent by multiple executors, and determines whether it needs to communicate with the planner. If the coordinator analyzes the observation data collected by multiple roadside monitoring and autonomous vehicles and determines that the accident has a small impact on the traffic conditions of the road section, indicating that the coordinator does not need to communicate with the planner, all executions will be allowed The operator continues to perform actions as originally planned. If the coordinator determines from the observation data that the accident will cause congestion on the road section, and it is necessary to clear the blocked traffic flow or guide the traffic flow to the surrounding idle road sections, indicating that the coordinator needs to communicate with the planner, the observation data will be collected and synthesized from multiple executors. It is converted into standard form data, that is, natural language that planners can understand, "An accident occurred in XX lane on XX section, and XX motor vehicles were blocked." and output to the large language model and traffic police.

（3）大语言模型和交警接收到标准形式数据后，基于该标准形式数据生成执行动作所对应的新的高级指令，将多智能体系统中的每个执行者需要执行的最优动作所对应的高级指令发送给该执行者，如给某个未驶入该路段的自动驾驶汽车下达高级指令“提前转弯”，给在该路段的自动驾驶汽车下达高级指令“提前避开事故车道”，给该路段的信号灯下达高级指令“保持更长时间的绿灯”等等，通过这样的联合最优动作让整个多智能体系统达到纳什均衡。(3) After receiving the standard form data, the large language model and the traffic police generate new high-level instructions corresponding to the execution actions based on the standard form data, corresponding to the optimal actions that each executor in the multi-agent system needs to perform. Send high-level instructions to the executor, such as issuing an advanced instruction to "turn in advance" to a self-driving car that has not entered the road section, and issuing an advanced command to "avoid the accident lane in advance" to a self-driving car in the road section. The traffic light on this road section issues high-level instructions such as "keep the green light longer" and so on. Through such joint optimal actions, the entire multi-agent system reaches a Nash equilibrium.

（4）每个执行者接收到对应的高级指令后，会调用与自然语言形式的高级指令相对应的预训练或预设的底层控制逻辑“转向”、“减速”、“变道”、“切换亮灯方案” 等，来从底层的机械控制层面完成这些指令，完成对应的执行动作。(4) After each executor receives the corresponding high-level instruction, it will call the pre-trained or preset underlying control logic "steering", "deceleration", "lane change", " "Switch lighting scheme" and so on, to complete these instructions from the underlying mechanical control level and complete the corresponding execution actions.

实施例3Example 3

如图5所示，显示了多智能体系统应用大语言模型交互方法的另一种实施例的通信示意图，在该系统中，包括多个执行者、一个协调者和一个规划者，该实施例中，应用大语言模型交互方法，具体包括以下步骤：As shown in Figure 5, a communication schematic diagram of another embodiment of applying a large language model interaction method to a multi-agent system is shown. In this system, it includes multiple executors, a coordinator and a planner. This embodiment , apply the large language model interaction method, which specifically includes the following steps:

（1）当前系统中有多个执行者，每个执行者都将自身采集到的观测数据发送给协调者，协调者将接收到的所有观测数据进行合并，得到合并的观测数据。(1) There are multiple executors in the current system. Each executor sends the observation data it collects to the coordinator. The coordinator merges all the observation data received to obtain the merged observation data.

（2）协调者根据合并的观测数据采用最优通信策略判断是否需要与规划者通信，若协调者不需要与规划者通信，则协调者将当前高级指令重新发送给对应的执行者，让每个执行者都继续执行原来的高级指令；若协调者需要与规划者通信，则协调者将合并的观测数据转化为标准形式数据，如文本字符串或图片形式的观测数据，并将该标准形式数据发送给规划者。(2) The coordinator uses the optimal communication strategy based on the merged observation data to determine whether it needs to communicate with the planner. If the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instructions to the corresponding executors, allowing each Each executor continues to execute the original high-level instructions; if the coordinator needs to communicate with the planner, the coordinator converts the merged observation data into standard form data, such as observation data in the form of text strings or pictures, and converts the standard form Data is sent to planners.

实施例4Example 4

如图6所示，显示了多智能体系统应用大语言模型交互方法的另一种实施例的通信示意图，在该系统中，包括多个执行者、多个协调者和多个子规划者以及一个总规划者，该实施例中，应用大语言模型交互方法，具体包括以下步骤：As shown in Figure 6, a communication schematic diagram of another embodiment of applying a large language model interaction method to a multi-agent system is shown. In this system, it includes multiple executors, multiple coordinators, multiple sub-planners and a The master planner, in this embodiment, applies the large language model interaction method, which specifically includes the following steps:

（1）每个执行者都将自身采集到的观测数据发送给对应的协调者。(1) Each executor sends the observation data it collects to the corresponding coordinator.

（2）协调者根据接收到的观测数据采用最优通信策略判断是否需要与子规划者通信，若协调者不需要与子规划者通信，则协调者将当前高级指令重新发送给对应的执行者，让该执行者继续执行原来的高级指令；若协调者需要与子规划者通信，则协调者将观测数据转化为标准形式数据，如文本字符串或图片形式的观测数据，并将该标准形式数据发送给子规划者。(2) The coordinator uses the optimal communication strategy based on the received observation data to determine whether it needs to communicate with the sub-planner. If the coordinator does not need to communicate with the sub-planner, the coordinator resends the current high-level instructions to the corresponding executor. , allowing the executor to continue executing the original high-level instructions; if the coordinator needs to communicate with the sub-planner, the coordinator converts the observation data into standard form data, such as text string or picture form observation data, and converts the standard form The data is sent to the sub-planner.

（3）每个字规划者在接收到标准形式数据后，基于该标准形式数据生成执行动作所对应的新的高级指令，之后，各个子规划者将各自生成的新的高级指令发送给总规划者，总规划者在整合各个子规划者发送的高级指令后，做出全局指令，即为每个执行者生成各自对应的高级指令，并将其发送给对应的执行者。(3) After receiving the standard form data, each sub-planner generates a new high-level instruction corresponding to the execution action based on the standard form data. After that, each sub-planner sends the new high-level instruction generated by each sub-planner to the master planner. After integrating the high-level instructions sent by each sub-planner, the master planner makes global instructions, that is, generates corresponding high-level instructions for each executor and sends them to the corresponding executor.

（4）各个执行者接收到高级指令后，根据当前的观测数据并调用与该高级指令对应的底层控制逻辑以执行对应的执行动作。(4) After each executor receives the high-level instruction, it calls the underlying control logic corresponding to the high-level instruction to perform the corresponding execution action based on the current observation data.

综上所述，使用MiniGrid环境中的任务来仿真智能工厂场景下的智能体，在该任务下，智能体需要在视野范围有限的情况下探索整个房间，找到与门的颜色对应的钥匙并用该钥匙开门。如图7所示，通过实验对比本发明采用的做法、提前人为设定智能体何时通信的做法、总是与大语言模型通信的做法、从不与大语言模型通信的做法这四种方法的结果，验证了本发明的有益效果。其中，图7中的（a）的纵坐标代表任务成功率，图7中的（b）的纵坐标代表智能体与大语言模型的通信次数，两个图的横坐标均代表实验循环次数。可以看出，随着不断的训练，本发明采用的做法在通信成本上不断降低，在任务成功率上不断提升，最终在通信成本和任务成功率上同时优于其他做法。In summary, the task in the MiniGrid environment is used to simulate the agent in the smart factory scenario. Under this task, the agent needs to explore the entire room with a limited field of view, find the key corresponding to the color of the door and use the key. Key opens the door. As shown in Figure 7, four methods are compared through experiments: the method adopted by the present invention, the method of artificially setting when the agent communicates in advance, the method of always communicating with a large language model, and the method of never communicating with a large language model. The results verify the beneficial effects of the present invention. Among them, the ordinate of (a) in Figure 7 represents the task success rate, the ordinate of (b) in Figure 7 represents the number of communications between the agent and the large language model, and the abscissa of both figures represents the number of experimental cycles. It can be seen that with continuous training, the method adopted in the present invention continues to reduce communication costs and continuously improves mission success rate, and ultimately is superior to other methods in both communication cost and mission success rate.

值得一提的是，本发明还提供了一种大语言模型交互装置。It is worth mentioning that the present invention also provides a large language model interaction device.

具体地，如图2所示，该装置包括规划者模块、协调者模块和执行者模块。其中，规划者模块用于根据接收到的标准形式数据生成执行动作所对应的新的高级指令。协调者模块用于根据观测数据采用最优通信策略判断是否需要与规划者通信，若协调者需要与规划者通信，则协调者将观测数据转化为标准形式数据，并将该标准形式数据发送给规划者；若协调者不需要与规划者通信，则协调者将当前高级指令重新发送给执行者。执行者模块用于采集观测数据，并在接收到高级指令后，调用与该高级指令对应的底层控制逻辑以执行对应的执行动作。Specifically, as shown in Figure 2, the device includes a planner module, a coordinator module and an executor module. Among them, the planner module is used to generate new high-level instructions corresponding to the execution actions based on the received standard form data. The coordinator module is used to use the optimal communication strategy based on the observation data to determine whether communication with the planner is needed. If the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data to Planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instructions to the executor. The executor module is used to collect observation data, and after receiving high-level instructions, call the underlying control logic corresponding to the high-level instructions to perform the corresponding execution actions.

上述实施例阐明的系统、装置、模块或单元，具体可以由计算机芯片或实体实现，或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的，计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.

为了描述的方便，描述以上装置时以功能分为各种单元分别描述。当然，在实施本发明时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described separately. Of course, when implementing the present invention, the functions of each unit can be implemented in the same or multiple software and/or hardware.

参见图8，本发明实施例提供的一种电子设备，包括存储器和处理器，存储器与处理器耦接；其中，存储器用于存储程序数据，处理器用于执行程序数据以上述实施例中的大语言模型交互方法。Referring to Figure 8, an electronic device provided by an embodiment of the present invention includes a memory and a processor. The memory is coupled to the processor. The memory is used to store program data, and the processor is used to execute the program data in a large format in the above embodiment. Language model interaction methods.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质（包括但不限于磁盘存储器、CD-ROM、光学存储器等）上实施的计算机程序产品的形式，该计算机可读存储介质上存储有计算机程序，程序被处理器执行时实现上述实施例中的大语言模型交互方法。Those skilled in the art will appreciate that embodiments of the present invention may be provided as methods, systems, or computer program products. Thus, the invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. The computer program is stored on the readable storage medium. When the program is executed by the processor, the large language model interaction method in the above embodiment is implemented.

本发明是参照根据本发明实施例的方法、设备（系统）、和计算机程序产品的流程图和／或方框图来描述的。应理解可由计算机程序指令实现流程图和／或方框图中的每一流程和／或方框、以及流程图和／或方框图中的流程和／或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in a process or processes in a flowchart and/or a block or blocks in a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes in the flowchart and/or in a block or blocks in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和／或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.

本领域技术人员应明白，本发明的实施例可提供为方法、系统或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质（包括但不限于磁盘存储器、CD-ROM、光学存储器等）上实施的计算机程序产品的形式。It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as methods, systems or computer program products. Thus, the invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明可以在由计算机执行的计算机可执行指令的一般上下文中描述，例如程序模块。一般地，程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本发明，在这些分布式计算环境中，由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中，程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. The present invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

本发明中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment of the present invention is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

以上实施例仅用于说明本发明的设计思想和特点，其目的在于使本领域内的技术人员能够了解本发明的内容并据以实施，本发明的保护范围不限于上述实施例。所以，凡依据本发明所揭示的原理、设计思路所作的等同变化或修饰，均在本发明的保护范围之内。The above embodiments are only used to illustrate the design ideas and features of the present invention, and their purpose is to enable those skilled in the art to understand the content of the present invention and implement it accordingly. The protection scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications made based on the principles and design ideas disclosed in the present invention are within the protection scope of the present invention.

Claims

1. A large language model interaction method, comprising the steps of:

(1) After the executor interacts with the environment, the current acquired observation data is sent to a coordinator;

(2) The coordinator judges whether the coordinator needs to communicate with the planner or not by adopting an optimal communication strategy according to the received observation data, and if the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instruction to the executor and jumps to step (4);

(3) After receiving the standard form data, the planner generates a new high-level instruction corresponding to the execution action based on the standard form data and sends the new high-level instruction to the executor;

(4) After receiving the high-level instruction, the executor calls the bottom control logic corresponding to the high-level instruction according to the current observation data to execute the corresponding execution action.

2. The large language model interaction method of claim 1, wherein the observation data includes sensor data, text data, and image data.

3. The large language model interaction method according to claim 1, wherein the optimal communication strategy specifically comprises: the coordinator decides whether to communicate with the planner at each moment so that the planner can complete the task through the minimum number of communication times, and defines the process as a reinforcement learning process, wherein the reinforcement learning process specifically comprises: corresponding to the state at each moment, the coordinator has two different actions-insisting on executing the current plan or requesting a new plan from the planner, introducing invalid communication penalty as a cumulative communication reward based on the rewards given by the environment; wherein the state is the observed data collected by the executor; the best communication strategy for the coordinator is achieved by maximizing the cumulative communication rewards.

4. A large language model interaction method according to claim 3, wherein the expression of the cumulative communication prize is:

；

wherein,for a cumulative communication reward at time t +.>Awards given by the environment for time t, +.>As a function of the characteristics of the display,for the action of the coordinator at the time t, ask represents the requirement of the coordinatorIn communication with the planner, not ask means that the coordinator does not need to communicate with the planner, +.>High-level instruction returned for time t planner, < >>For rewarding discount coefficient, < >>Penalty coefficients for invalid communications.

5. The large language model interaction method of claim 3, wherein the training method for maximizing cumulative communication rewards comprises a near-end policy optimization method, a maximum entropy actor-critique method, a depth Q network, and a dominant actor-critique method.

6. The large language model interaction method of claim 1, wherein the executor comprises a plurality of agents; the planner includes a large language model and the user cooperates with the large language model.

7. The large language model interaction method of claim 1, wherein the high-level instructions are in one-to-one correspondence with the underlying control logic, the underlying control logic is in one-to-one correspondence with the execution actions, and the high-level instructions are in one-to-one correspondence with the execution actions.

8. A large language model interaction device for implementing the large language model interaction method of any one of claims 1 to 7, comprising:

the planner module is used for generating a new high-level instruction corresponding to the execution action according to the received standard form data;

the coordinator module is used for judging whether the coordinator needs to communicate with the planner or not by adopting an optimal communication strategy according to the observation data, and if the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instruction to the executor; and

and the executor module is used for collecting the observation data and calling the bottom control logic corresponding to the high-level instruction to execute the corresponding execution action after receiving the high-level instruction.

9. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is for storing program data and the processor is for executing the program data to implement the large language model interaction method of any of claims 1-7.

10. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the large language model interaction method of any of claims 1-7.