CN117236416A - Large language model interaction method and device - Google Patents
Large language model interaction method and device Download PDFInfo
- Publication number
- CN117236416A CN117236416A CN202311498497.XA CN202311498497A CN117236416A CN 117236416 A CN117236416 A CN 117236416A CN 202311498497 A CN202311498497 A CN 202311498497A CN 117236416 A CN117236416 A CN 117236416A
- Authority
- CN
- China
- Prior art keywords
- planner
- coordinator
- language model
- large language
- executor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 90
- 230000003993 interaction Effects 0.000 title claims abstract description 45
- 238000004891 communication Methods 0.000 claims abstract description 73
- 230000002787 reinforcement Effects 0.000 claims abstract description 15
- 230000009471 action Effects 0.000 claims description 45
- 230000001186 cumulative effect Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 9
- 238000003860 storage Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 abstract description 2
- 239000003795 chemical substances by application Substances 0.000 description 34
- 238000010586 diagram Methods 0.000 description 19
- 238000012546 transfer Methods 0.000 description 12
- 230000006399 behavior Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 6
- 238000012544 monitoring process Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 239000008267 milk Substances 0.000 description 1
- 210000004080 milk Anatomy 0.000 description 1
- 235000013336 milk Nutrition 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a large language model interaction method and a large language model interaction device, and the method provides a new large language model interaction frame of planners, coordinators and executors, wherein the large language model is used as a planner, an intelligent agent is used as an executor, the newly added coordinators can determine when to request to communicate with the planners, and convert the current observed data of the executors into text strings in a natural language form which can be understood by the planners, and the coordinators can pretrain through reinforcement learning based on invalid communication punishment to implement an optimal communication strategy. The invention can obviously reduce the communication times with the planner after formally deploying to the test environment by implementing the optimal communication strategy, and simultaneously, the coordinator can reduce the dependence on the planner in the scene that the planner is easy to make mistakes, and can help the planner in time when facing emergency, thereby improving the safety and the task success rate of the executor.
Description
Technical Field
The invention relates to the field of reinforcement learning and natural language processing, in particular to a large language model interaction method and device.
Background
A large language model is an artificial intelligence model that aims to understand and generate human language. They train on large amounts of text data, containing billions of parameters, so that complex patterns in language data can be learned. The large language model has achieved great success in recent years, and the ChatGPT issued by OpenAI is a great concern for all communities.
Part of the research assists the decision and planning of an agent by means of knowledge and reasoning capabilities of a large language model, but how to communicate reasonably and efficiently with a large language model during the completion of tasks by an agent remains an unresolved open subject. For example, although the SayCan method proposed by google team solves the problem of controlling the motion of the mechanical arm better than the conventional method by means of a large language model, the method requires the agent to communicate with the large language model at every moment, and since the large language model contains billions of parameter amounts, the agent spends a lot of time and calculation resources each time it communicates with the large language model, and if the agent communicates with the large language model at every step in the task execution process, the spending is very large. In addition, when an agent encounters unexpected situations, if not resorting to a large language model in time, security problems may result; for example, when the intelligent agent performs the task of "getting a cup of water from a room to be separated and returning to the room," a gust of wind blows, accidentally closes the door, and if the robot continues to perform the forward movement, the robot and the door are damaged. When the large language model is in error, if a good error correction mechanism is not available, tasks cannot be completed, and even safety problems occur.
The common intelligent agent system based on large language model guidance divides the whole control process into a planner based on large language model and specially providing high-level instructions on a logic level and an executor based on pre-training or pre-setting and specially processing bottom-layer motion control. On the basis of the original framework, the invention adds the coordinator as an intermediary between the planner and the executor so as to judge whether the communication with the planner is needed. The coordinator uses reinforcement learning to maximize the cumulative communication rewards-i.e., to have the agent perform its tasks with a minimum number of communications to solve the above-mentioned problem of the agent (executor) communicating with the large language model (planner).
Disclosure of Invention
The invention aims to provide a large language model interaction method and device aiming at the defects in the prior art.
The aim of the invention is realized by the following technical scheme: the first aspect of the embodiment of the invention provides a large language model interaction method, which comprises the following steps:
(1) After the executor interacts with the environment, the current acquired observation data is sent to a coordinator;
(2) The coordinator judges whether the coordinator needs to communicate with the planner or not by adopting an optimal communication strategy according to the received observation data, and if the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instruction to the executor and jumps to step (4);
(3) After receiving the standard form data, the planner generates a new high-level instruction corresponding to the execution action based on the standard form data and sends the new high-level instruction to the executor;
(4) After receiving the high-level instruction, the executor calls the bottom control logic corresponding to the high-level instruction according to the current observation data to execute the corresponding execution action.
Further, the observation data includes sensor data, text data, and image data.
Further, the optimal communication strategy specifically includes: the coordinator decides whether to communicate with the planner at each moment so that the planner can complete the task through the minimum number of communication times, and defines the process as a reinforcement learning process, wherein the reinforcement learning process specifically comprises: corresponding to the state at each moment, the coordinator has two different actions-insisting on executing the current plan or requesting a new plan from the planner, introducing invalid communication penalty as a cumulative communication reward based on the rewards given by the environment; wherein the state is the observed data collected by the executor; the best communication strategy for the coordinator is achieved by maximizing the cumulative communication rewards.
Further, the expression of the cumulative communication prize is:
;
wherein,for a cumulative communication reward at time t +.>Awards given by the environment for time t, +.>As a function of the characteristics of the display,for actions of the coordinator at time t, ask indicates that the coordinator needs to communicate with the planner, nor ask indicates that the coordinator does not need to communicate with the planner, +.>High-level instruction returned for time t planner, < >>For rewarding discount coefficient, < >>Penalty coefficients for invalid communications.
Further, the training method for maximizing the cumulative communication rewards includes a near-end strategy optimization method, a maximum entropy actor-critique method, a depth Q network, and a dominant actor-critique method.
Further, the actor includes a plurality of agents; the planner includes a large language model and the user cooperates with the large language model.
Further, the high-level instructions are in one-to-one correspondence with the bottom layer control logic, the bottom layer control logic is in one-to-one correspondence with the execution actions, and the high-level instructions are in one-to-one correspondence with the execution actions.
A second aspect of the embodiment of the present invention provides a large language model interaction device, configured to implement the foregoing large language model interaction method, including:
the planner module is used for generating a new high-level instruction corresponding to the execution action according to the received standard form data;
the coordinator module is used for judging whether the coordinator needs to communicate with the planner or not by adopting an optimal communication strategy according to the observation data, and if the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instruction to the executor; and
and the executor module is used for collecting the observation data and calling the bottom control logic corresponding to the high-level instruction to execute the corresponding execution action after receiving the high-level instruction.
A third aspect of an embodiment of the invention provides an electronic device comprising a memory and a processor, the memory coupled to the processor; the memory is used for storing program data, and the processor is used for executing the program data to realize the large language model interaction method.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described large language model interaction method.
The invention has the beneficial effects that by providing a new interactive framework and adding a coordinator, the time cost and the computing resource cost brought by communication are effectively reduced by using the added coordinator as an intermediary for connecting the large language model and the intelligent body; meanwhile, after the coordinator is introduced, the intelligent agent is more facilitated to timely resort to the large language model when facing emergency, dependence on the large language model is reduced under the scene that the large language model is easy to make mistakes, and the safety and task success rate of the intelligent agent are improved.
Drawings
FIG. 1 is a schematic flow diagram of reinforcement learning;
FIG. 2 is a general flow diagram of a large language model interaction method;
FIG. 3 is a communication schematic diagram of one embodiment of a single agent system employing a large language model interaction method;
FIG. 4 is a communication diagram of one embodiment of a multi-agent system employing a large language model interaction method;
FIG. 5 is a communication schematic diagram of another embodiment of a multi-agent system employing a large language model interaction method;
FIG. 6 is a communication diagram of another embodiment of a multi-agent system employing a large language model interaction method;
FIG. 7 is a schematic diagram of experimental verification results of the effectiveness of the large language model interaction method in simulating the MiniGrid environment of an intelligent factory; fig. 7 (a) is a schematic diagram of an experimental verification result of a task success rate of the large language model interaction method in a MiniGrid environment simulating an intelligent factory, and fig. 7 (b) is a schematic diagram of an experimental verification result of a communication cost of the large language model interaction method in the MiniGrid environment simulating the intelligent factory;
FIG. 8 is a schematic diagram of a large language model interaction device.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the invention. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The present invention will be described in detail with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.
Referring to fig. 2, the large language model interaction method of the present invention specifically includes the following steps:
(1) After the executor interacts with the environment, the current acquired observation data is sent to the coordinator.
Further, the observed data includes, but is not limited to: sensor data, text data, image data, etc.
It should be noted that, the executor includes a plurality of agents, such as robots, traffic lights, etc., and may perform certain actions based on pre-training or preset underlying motion control to complete the task to be executed. It should be appreciated that in performing a task in a system, the actor may be a single agent, such as a robot or traffic light; the executives may also be multiple agents, such as multiple lights, roadside monitoring, and automatic driving of a car.
Specifically, when an operator performs a task, for example, when the robot performs a task of passing through a door, the robot needs to collect current observation data through various sensors, an image collection device carried by the robot, and the like during moving, i.e., after interacting with the environment, and the robot sends the collected observation data to a coordinator, and then the coordinator can determine whether the robot passes through the door directly or passes through the door again according to the observation data.
(2) The coordinator judges whether the coordinator needs to communicate with the planner or not by adopting an optimal communication strategy according to the received observation data, and if the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instruction to the executor and jumps to step (4).
It should be noted that, the standard form data is a natural language that can be understood by the planner, and the observation data can be converted into the standard form data through the YOLO target detection algorithm or the CLIP multi-modal algorithm, that is, the natural language that can be understood by the planner. The standard form data can be text character strings or pictures, for example, a large language model-GPT 4 can support picture input and also can support text input.
In this embodiment, the optimal communication policy specifically includes: the coordinator decides whether to communicate with the planner at each moment so that the planner can complete the task through the minimum number of communication times, and defines the process as a reinforcement learning process, wherein the reinforcement learning process specifically comprises: corresponding to the state at each moment, the coordinator has two different actions-insisting on executing the current plan or requesting a new plan from the planner, introducing invalid communication penalty as a cumulative communication reward based on the rewards given by the environment; wherein the state is the observed data collected by the executor; the best communication strategy for the coordinator is achieved by maximizing the cumulative communication rewards.
It should be appreciated that reinforcement learning is a sub-area of machine learning, and the question is how agents act in the environment to obtain a maximized jackpot. Reinforcement learning mainly comprises an agent, an environment, a state, an action and a reward, as shown in fig. 1, after the agent performs a certain action, the environment is converted into a new state, and the new state environment is rewarded, and then the agent performs a new action according to a preset strategy according to the new state and the reward fed back by the environment.
Further, the rewards given by the environment represent the completion condition of the task, and on the basis, invalid communication penalties are introduced, wherein the invalid communication penalties specifically comprise: when the coordinator decides to communicate with the planner and the high-level instruction returned by the communication between the current moment and the planner is the same as the high-level instruction returned by the communication at the previous moment, the coordinator does not need to communicate with the planner at the moment, and the coordinator receives negative feedback.
Further, the expression of the cumulative communication prize is:
wherein,for a cumulative communication reward at time t +.>Awards given by the environment for time t, +.>As a function of the characteristics of the display,for actions of the coordinator at time t, ask indicates that the coordinator needs to communicate with the planner, nor ask indicates that the coordinator does not need to communicate with the planner, +.>High-level instruction returned for time t planner, < >>For rewarding discount coefficient, < >>Penalty coefficients for invalid communications.
It should be appreciated that when the coordinator's action is "communication," it is illustrated that at this point the coordinator requests a new plan from the planner; when the action of the coordinator is 'no communication', the coordinator is informed of the fact that the coordinator is consistent with executing the current plan.
Further, the optimal communication strategy of the coordinator can be obtained by maximizing the accumulated communication rewards through classical reinforcement learning methods such as a near-end strategy optimization method (PPO), a maximum entropy actor-critique method (SAC), a Depth Q Network (DQN) or a dominant actor-critique method (A2C). After formally deploying the system in a test environment, a coordinator receives the state (namely observation data) sent by an executor, and outputs the action of whether to communicate with a planner or not by calling the optimal communication strategy, so that on one hand, the communication times with a large language model are obviously reduced, the time cost and the calculation resource cost brought by communication are reduced, on the other hand, the system is more beneficial to the fact that an intelligent body can timely resort to the large language model when facing emergency, and the dependence on the large language model is reduced in a scene that the large language model is easy to make mistakes, and the safety and the task success rate of the intelligent body are improved.
It should be appreciated that the near-end policy optimization method is a classical reinforcement learning method, and a small batch of updating can be realized through a plurality of training steps, so that a step length can be easily determined, one behavior is selected for counter propagation through observation data, the probability of selecting the behavior is directly enhanced and weakened by rewards, the probability of selecting the behavior next time is increased by good behaviors, and the probability of selecting the behavior next time is weakened by bad behaviors. The maximum entropy actor-critique method is a classical reinforcement learning method, and finally guides an agent to select optimal behaviors through cooperation of the actor and the critique, the actor maximizes the entropy of expected and strategy distribution at the same time, and ensures randomness of a behavior strategy while selecting the optimal behaviors, so that the method is efficient and stable, and has stronger robustness to different environments.
(3) After receiving the standard form data, the planner generates a new high-level instruction corresponding to the execution action based on the standard form data and sends the new high-level instruction to the executor.
The planner includes, but is not limited to, a large language model and the user cooperates with the large speech model. It should be understood that the planner may be a large language model, and the output of the high-level instructions is realized by encoding and decoding the natural language through the neural network; the technical expert can also intervene manually, such as traffic police in an intelligent traffic system; or the alternatives are given by a large language model and selected by the user according to preferences.
Specifically, when the standard form data received by the planner is "observe that milk is sprinkled on the desktop", it can be determined based on the standard form data that the action to be executed by the executor is "the control mechanical arm wipes the desktop clean", the planner generates a high-level instruction corresponding to the action, and then the high-level instruction is sent to the executor.
(4) After receiving the high-level instruction, the executor calls the bottom control logic corresponding to the high-level instruction according to the current observation data to execute the corresponding execution action.
Specifically, the executor has its own corresponding bottom control logic, where the bottom control logic corresponds to the high-level instruction one by one, and the bottom control logic corresponds to the execution action one by one, and the high-level instruction corresponds to the execution action one by one, so that after receiving the high-level instruction, the executor will call the bottom control logic corresponding to the high-level instruction, and execute the corresponding execution action. The underlying control logic may be pre-trained or may be preset. After receiving a new high-level instruction sent by a planner, an executor executes a new execution action corresponding to the new high-level instruction; after receiving the current high-level instruction sent by the coordinator, the executor can continue to execute the pre-trained or preset execution action corresponding to the current high-level instruction.
The objects and effects of the present invention will become more apparent by describing in detail a large language model interaction method of the present invention according to an embodiment.
Example 1
As shown in FIG. 3, a communication schematic of one embodiment of a method of applying large language model interactions in a single agent system is shown.
Specifically, taking a certain working state of the transfer robot of the intelligent factory as an example (practical application scenario includes but is not limited to the transfer robot), the transfer robot is an executor, and the large language model is a planner. The large language model interaction method applied to the single-agent system comprises the following steps:
(1) The transfer robot transmits observation data such as image data acquired by the camera to the coordinator on the way to the next object to be transferred.
(2) The coordinator receives the observation data sent by the transfer robot and adopts an optimal communication strategy to judge whether communication with the planner is needed. If the image acquired by the camera of the transfer robot has no object to be transferred or is far away from the object to be transferred, the coordinator does not need to communicate with the large language model, and then the coordinator resends the current high-level instruction to the transfer robot, so that the transfer robot continues to execute the original preset moving action. If the transfer robot is close to the object to be transferred, the coordinator needs to communicate with the planner, the observed data such as image data collected by the transfer robot camera are converted into standard form data, namely, the natural language which can be understood by the large language model is that the height XX, the width XX of the object, and the relative position of the transfer robot is XX, and the data are output to the large language model.
(3) After the large language model receives the standard form data, a new high-level instruction corresponding to the execution action is generated based on the standard form data, namely, the high-level instruction in the natural language form is generated, namely, after the XX angle is rotated to be aligned with the object, the two hands extend forward for XX distance at XX intervals and hold the object, the two hands are retracted to the front after the XX height of the object is raised, and then the new high-level instruction is issued to the transfer robot.
(4) After receiving the high-level instruction, the carrying robot invokes a pre-trained or preset bottom layer control logic of 'rotating direction', 'extending both hands', 'lifting up an object', 'retracting both hands', and the like corresponding to the high-level instruction in a natural language form, so as to complete the instructions from the mechanical control layer of the bottom layer and complete corresponding execution actions.
Example 2
As shown in FIG. 4, a communication diagram of one embodiment of a multi-intelligent system application large language model interaction method is shown.
Specifically, taking a certain working state of the intelligent traffic system as an example (practical application scenario includes but is not limited to the intelligent traffic system), in the system, an executor includes a plurality of signal lamps, roadside monitoring and automatic driving automobiles, and a planner is a collaboration of a large language model and traffic police. The large language model interaction method applied to the multi-agent system comprises the following steps:
(1) And a plurality of roadside monitoring and automatic driving automobile vehicle-mounted cameras display accidents on a certain road section, and the current acquired observation data are sent to a coordinator.
(2) The coordinator receives and gathers the observation data sent by the plurality of executives and judges whether communication with the planner is needed. If the coordinator analyzes the observation data collected by the roadside monitoring and automatic driving automobiles, the fact that the accident has small influence on the traffic situation of the road section is judged, and the fact that the coordinator does not need to communicate with the planner is indicated, all executors can continue to execute actions according to the original plan. If the coordinator judges that the road section is blocked by the accident through the observation data, the coordinator needs to dredge the blocked traffic flow or guide the traffic flow to the peripheral idle road section, and the coordinator needs to communicate with the planner, the observation data collected by a plurality of executors are comprehensively converted into standard form data, namely, the planner can understand that the accident occurs in the natural language XX road section XX lane, and XX motor vehicles are blocked, and the natural language XX road section XX lane is output to a large language model and traffic police.
(3) After the large language model and the traffic police receive the standard form data, new high-level instructions corresponding to the execution actions are generated based on the standard form data, the high-level instructions corresponding to the optimal actions required to be executed by each executor in the multi-agent system are sent to the executor, for example, the high-level instructions are sent to an automatic driving automobile which does not drive the road section to turn ahead, the high-level instructions are sent to the automatic driving automobile on the road section to avoid accident lanes ahead, the high-level instructions are sent to a signal lamp on the road section to keep longer green light, and the like, and the whole multi-agent system can achieve Nash equilibrium through the combined optimal actions.
(4) After each executor receives the corresponding high-level instruction, the executor can call the pre-training or preset bottom layer control logic of steering, decelerating, changing channels, switching the lighting scheme and the like corresponding to the high-level instruction in the natural language form to finish the instructions from the mechanical control layer of the bottom layer, so as to finish the corresponding execution actions.
Example 3
As shown in fig. 5, a communication schematic diagram of another embodiment of a multi-intelligent system for applying a large language model interaction method is shown, in which the system includes a plurality of executives, a coordinator and a planner, and in this embodiment, the large language model interaction method is applied, which specifically includes the following steps:
(1) In the current system, a plurality of executors are arranged, each executor transmits the observation data acquired by the executor to a coordinator, and the coordinator combines all the received observation data to obtain combined observation data.
(2) The coordinator judges whether the coordinator needs to communicate with the planner or not by adopting an optimal communication strategy according to the combined observation data, and if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instruction to the corresponding executor so that each executor can continuously execute the original high-level instruction; if the coordinator needs to communicate with the planner, the coordinator converts the combined observation data into standard form data, such as text strings or picture form observation data, and sends the standard form data to the planner.
(3) After receiving the standard form data, the planner generates a new high-level instruction corresponding to the execution action based on the standard form data and sends the new high-level instruction to the executor.
(4) After receiving the high-level instruction, the executor calls the bottom control logic corresponding to the high-level instruction according to the current observation data to execute the corresponding execution action.
Example 4
As shown in fig. 6, a communication schematic diagram of another embodiment of a multi-intelligent system for applying a large language model interaction method is shown, in which the system includes a plurality of executives, a plurality of coordinators and a plurality of sub-planners, and a total planner, and in this embodiment, the large language model interaction method is applied, which specifically includes the following steps:
(1) Each executor transmits the observation data acquired by the executor to the corresponding coordinator.
(2) The coordinator judges whether to communicate with the sub-planners or not by adopting an optimal communication strategy according to the received observation data, and if the coordinator does not need to communicate with the sub-planners, the coordinator resends the current high-level instruction to the corresponding executor so that the executor can continue to execute the original high-level instruction; if the coordinator needs to communicate with the sub-planners, the coordinator converts the observed data into standard form data, such as text strings or picture form observed data, and sends the standard form data to the sub-planners.
(3) After each word planner receives the standard form data, a new high-level instruction corresponding to the execution action is generated based on the standard form data, then each sub-planner sends the generated new high-level instruction to the overall planner, and the overall planner makes a global instruction after integrating the high-level instructions sent by each sub-planner, namely, each executor generates the corresponding high-level instruction and sends the corresponding high-level instruction to the corresponding executor.
(4) After each executor receives the high-level instruction, according to the current observation data, the bottom control logic corresponding to the high-level instruction is called to execute the corresponding execution action.
In summary, the task in the MiniGrid environment is used to simulate the agent in the smart factory scenario, under which the agent needs to explore the whole room with a limited field of view, find the key corresponding to the color of the door, and use that key to open the door. As shown in fig. 7, the beneficial effects of the present invention were verified by comparing the results of the four methods of the method adopted by the present invention, the method of manually setting when the agent communicates in advance, the method of always communicating with the large language model, and the method of never communicating with the large language model. Wherein, the ordinate of (a) in fig. 7 represents the task success rate, the ordinate of (b) in fig. 7 represents the number of times of communication between the agent and the large language model, and the abscissa of both figures represents the number of experimental cycles. It can be seen that with continuous training, the method adopted by the invention is continuously reduced in communication cost, is continuously improved in task success rate, and is finally superior to other methods in communication cost and task success rate.
It is worth mentioning that the invention also provides a large language model interaction device.
Specifically, as shown in fig. 2, the apparatus includes a planner module, a coordinator module, and an executor module. The planner module is used for generating a new high-level instruction corresponding to the execution action according to the received standard form data. The coordinator module is used for judging whether the coordinator needs to communicate with the planner or not by adopting an optimal communication strategy according to the observation data, and if the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instructions to the executor. The executor module is used for collecting observation data, and after receiving the high-level instruction, the executor module calls the bottom control logic corresponding to the high-level instruction to execute the corresponding execution action.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present invention.
Referring to fig. 8, an electronic device according to an embodiment of the present invention includes a memory and a processor, where the memory is coupled to the processor; the memory is used for storing program data, and the processor is used for executing the large language model interaction method in the embodiment.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer program(s) stored thereon that when executed by a processor performs the large language model interaction method of the above-described embodiments.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments of the present invention are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.
Claims (10)
1. A large language model interaction method, comprising the steps of:
(1) After the executor interacts with the environment, the current acquired observation data is sent to a coordinator;
(2) The coordinator judges whether the coordinator needs to communicate with the planner or not by adopting an optimal communication strategy according to the received observation data, and if the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instruction to the executor and jumps to step (4);
(3) After receiving the standard form data, the planner generates a new high-level instruction corresponding to the execution action based on the standard form data and sends the new high-level instruction to the executor;
(4) After receiving the high-level instruction, the executor calls the bottom control logic corresponding to the high-level instruction according to the current observation data to execute the corresponding execution action.
2. The large language model interaction method of claim 1, wherein the observation data includes sensor data, text data, and image data.
3. The large language model interaction method according to claim 1, wherein the optimal communication strategy specifically comprises: the coordinator decides whether to communicate with the planner at each moment so that the planner can complete the task through the minimum number of communication times, and defines the process as a reinforcement learning process, wherein the reinforcement learning process specifically comprises: corresponding to the state at each moment, the coordinator has two different actions-insisting on executing the current plan or requesting a new plan from the planner, introducing invalid communication penalty as a cumulative communication reward based on the rewards given by the environment; wherein the state is the observed data collected by the executor; the best communication strategy for the coordinator is achieved by maximizing the cumulative communication rewards.
4. A large language model interaction method according to claim 3, wherein the expression of the cumulative communication prize is:
;
wherein,for a cumulative communication reward at time t +.>Awards given by the environment for time t, +.>As a function of the characteristics of the display,for the action of the coordinator at the time t, ask represents the requirement of the coordinatorIn communication with the planner, not ask means that the coordinator does not need to communicate with the planner, +.>High-level instruction returned for time t planner, < >>For rewarding discount coefficient, < >>Penalty coefficients for invalid communications.
5. The large language model interaction method of claim 3, wherein the training method for maximizing cumulative communication rewards comprises a near-end policy optimization method, a maximum entropy actor-critique method, a depth Q network, and a dominant actor-critique method.
6. The large language model interaction method of claim 1, wherein the executor comprises a plurality of agents; the planner includes a large language model and the user cooperates with the large language model.
7. The large language model interaction method of claim 1, wherein the high-level instructions are in one-to-one correspondence with the underlying control logic, the underlying control logic is in one-to-one correspondence with the execution actions, and the high-level instructions are in one-to-one correspondence with the execution actions.
8. A large language model interaction device for implementing the large language model interaction method of any one of claims 1 to 7, comprising:
the planner module is used for generating a new high-level instruction corresponding to the execution action according to the received standard form data;
the coordinator module is used for judging whether the coordinator needs to communicate with the planner or not by adopting an optimal communication strategy according to the observation data, and if the coordinator needs to communicate with the planner, the coordinator converts the observation data into standard form data and sends the standard form data to the planner; if the coordinator does not need to communicate with the planner, the coordinator resends the current high-level instruction to the executor; and
and the executor module is used for collecting the observation data and calling the bottom control logic corresponding to the high-level instruction to execute the corresponding execution action after receiving the high-level instruction.
9. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is for storing program data and the processor is for executing the program data to implement the large language model interaction method of any of claims 1-7.
10. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the large language model interaction method of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311498497.XA CN117236416A (en) | 2023-11-13 | 2023-11-13 | Large language model interaction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311498497.XA CN117236416A (en) | 2023-11-13 | 2023-11-13 | Large language model interaction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117236416A true CN117236416A (en) | 2023-12-15 |
Family
ID=89098640
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311498497.XA Pending CN117236416A (en) | 2023-11-13 | 2023-11-13 | Large language model interaction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117236416A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118092764A (en) * | 2024-03-11 | 2024-05-28 | 北京邮电大学 | Method and device for controlling actions of intelligent agent guided by large language model |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116661503A (en) * | 2023-08-02 | 2023-08-29 | 中国人民解放军96901部队 | Cluster track automatic planning method based on multi-agent safety reinforcement learning |
CN116841615A (en) * | 2023-06-07 | 2023-10-03 | 福建天泉教育科技有限公司 | Man-machine interaction method and system based on large language model |
-
2023
- 2023-11-13 CN CN202311498497.XA patent/CN117236416A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116841615A (en) * | 2023-06-07 | 2023-10-03 | 福建天泉教育科技有限公司 | Man-machine interaction method and system based on large language model |
CN116661503A (en) * | 2023-08-02 | 2023-08-29 | 中国人民解放军96901部队 | Cluster track automatic planning method based on multi-agent safety reinforcement learning |
Non-Patent Citations (1)
Title |
---|
BIN HU 等: "Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach", 《ARXIV》, vol. 2023 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118092764A (en) * | 2024-03-11 | 2024-05-28 | 北京邮电大学 | Method and device for controlling actions of intelligent agent guided by large language model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7532615B2 (en) | Planning for autonomous vehicles | |
CN113805572B (en) | Method and device for motion planning | |
Tai et al. | Socially compliant navigation through raw depth inputs with generative adversarial imitation learning | |
CN111367282B (en) | Robot navigation method and system based on multimode perception and reinforcement learning | |
Li et al. | Survey on artificial intelligence for vehicles | |
US11132211B1 (en) | Neural finite state machines | |
CN109753047A (en) | System and method for autonomous vehicle behaviour control | |
Cai et al. | DQ-GAT: Towards safe and efficient autonomous driving with deep Q-learning and graph attention networks | |
CN109791409A (en) | The motion control decision of autonomous vehicle | |
CN117236416A (en) | Large language model interaction method and device | |
CN115303297B (en) | Urban scene end-to-end automatic driving control method and device based on attention mechanism and graph model reinforcement learning | |
Menéndez-Romero et al. | Courtesy behavior for highly automated vehicles on highway interchanges | |
Fernández-Isabel et al. | Modeling multi-agent systems to simulate sensor-based Smart Roads | |
KR20190098935A (en) | Artificial intelligence laundry treating apparatus | |
US20240336279A1 (en) | Method and system for expanding the operational design domain of an autonomous agent | |
Ahmed et al. | Policy-based reinforcement learning for training autonomous driving agents in urban areas with affordance learning | |
Cai et al. | Carl-lead: Lidar-based end-to-end autonomous driving with contrastive deep reinforcement learning | |
Huynh et al. | A Method of Deep Reinforcement Learning for Simulation of Autonomous Vehicle Control. | |
Yun et al. | Parallelized and randomized adversarial imitation learning for safety-critical self-driving vehicles | |
Ilievski | Wisebench: A motion planning benchmarking framework for autonomous vehicles | |
Li et al. | Stochastic pedestrian avoidance for autonomous vehicles using hybrid reinforcement learning | |
CN113485300B (en) | Automatic driving vehicle collision test method based on reinforcement learning | |
CN114859921A (en) | Automatic driving optimization method based on reinforcement learning and related equipment | |
Mahajan et al. | Intent-Aware Autonomous Driving: A Case Study on Highway Merging Scenarios | |
CN115700626A (en) | Reward function for a vehicle |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |