WO2021209023A1 - 智能体动作的决策方法及相关设备 - Google Patents

智能体动作的决策方法及相关设备 Download PDF

Info

Publication number
WO2021209023A1
WO2021209023A1 PCT/CN2021/087642 CN2021087642W WO2021209023A1 WO 2021209023 A1 WO2021209023 A1 WO 2021209023A1 CN 2021087642 W CN2021087642 W CN 2021087642W WO 2021209023 A1 WO2021209023 A1 WO 2021209023A1
Authority
WO
WIPO (PCT)
Prior art keywords
agent
collaboration
message
model
reward
Prior art date
Application number
PCT/CN2021/087642
Other languages
English (en)
French (fr)
Inventor
张公正
胡斌
徐晨
王坚
李榕
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21788348.7A priority Critical patent/EP4120042A4/en
Publication of WO2021209023A1 publication Critical patent/WO2021209023A1/zh
Priority to US17/964,233 priority patent/US20230032176A1/en

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • G06F18/2178Validation; Performance evaluation; Active pattern learning techniques based on feedback of a supervisor
    • G06F18/2185Validation; Performance evaluation; Active pattern learning techniques based on feedback of a supervisor the supervisor being an automated module, e.g. intelligent oracle
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/0265Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion
    • G05B13/027Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric the criterion being a learning criterion using neural networks only
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0287Control of position or course in two dimensions specially adapted to land vehicles involving a plurality of land vehicles, e.g. fleet or convoy travelling
    • G05D1/0291Fleet control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L5/00Arrangements affording multiple use of the transmission path
    • H04L5/003Arrangements for allocating sub-channels of the transmission path
    • H04L5/0032Distributed allocation, i.e. involving a plurality of allocating devices, each making partial allocation
    • H04L5/0035Resource allocation in a cooperative multipoint environment

Definitions

  • the embodiments of the present application relate to the field of communication technology, and in particular, to a method for decision-making of an agent action and related equipment.
  • Multi-agents need to interact with the environment at the same time and complete corresponding actions to complete the specified tasks.
  • Reinforcement learning refers to the way that the agent interacts with the environment to learn, according to the status of the environment feedback Acting on the environment to obtain rewards and obtain knowledge according to the reward mechanism, thereby improving the agent's response to the environment.
  • the agent In multi-agent reinforcement learning, since it is usually difficult for a single agent to observe global state information, it is necessary for each agent to communicate to exchange state information to achieve multi-intelligence collaboration.
  • the communication content and communication method need to be determined.
  • the former solves the content of transmission, and the latter solves the transmission path.
  • the agent usually uses a neural network model to determine the communication content.
  • the input of the model is the state of the environment, and the output is transmitted.
  • the communication information to other networks and their own execution actions are then used as the feedback evaluation of the neural network based on the rewards of the entire collaborative task, and the neural network model is trained according to the evaluation.
  • the communication information does not directly affect the completion of the collaboration task, but indirectly affects the reward corresponding to the collaboration task by affecting the actions of each agent, using task rewards to guide the learning of the communication information will make it more difficult to train the model and output at the same time.
  • Information and actions will significantly increase the size of the network, and the problem of difficulty in model training needs to be solved urgently; on the other hand, communication overhead is not considered in the goal of reinforcement learning, which may cause the learned communication content to have a large dimensionality and cannot be applied to the actual situation.
  • Communication scenarios especially wireless communication with limited bandwidth.
  • This application provides an agent action decision-making method to solve the problem of difficulty in training a neural network model caused by using task rewards to indirectly determine the content of communication; taking the influence of communication on actions and communication dimensions as part of the learning goal, So as to solve the problem of communication difficulty in learning and communication overhead between agents.
  • the first aspect of this application provides an agent action decision-making method, including:
  • the first agent obtains the first state information from the environment, and processes the obtained first state information through the first model to obtain the first collaboration message, which is not only It is used for the decision of the first agent’s own actions, or the communication information transmitted to other agents; then the first agent inputs the first collaboration message and the second collaboration message received from other agents into the second model to obtain The first collaborative action that the first agent needs to perform; among them, the first model and the second model learn according to the same task reward during the training process, and the reward is related to the state information acquired by each agent.
  • the obtained collaboration message and the finally determined collaboration action corresponding to each agent are related to the evaluation, and multiple agents can train the first model and the second model based on the evaluation.
  • This application uses the first model to process the first status information to obtain the first collaboration message, and then uses the second model to process the first collaboration message to obtain the first collaboration action, so that the original collaboration message and the collaboration message are simultaneously obtained based on the status information.
  • the parallel network model of collaborative actions becomes a serial network model.
  • rewards are not only related to tasks, but also related to status information, collaborative messages, and collaborative actions. This makes the structure of each network model simpler, and the new reward mechanism can be directly
  • the first model and the second model are evaluated, and the training of the neural network model is also easier.
  • this application also provides a first implementation manner of the first aspect:
  • Reward is the evaluation of the result of completing the task among multiple agents based on the same collaborative task. It is an evaluation mechanism related to the task itself. Therefore, for a certain task, the higher the completion of the task, the higher the goal is achieved. The higher the degree, the higher the reward. At the same time, the reward mechanism needs to include the first reward and the second reward.
  • the first reward is the correlation between the status information and the first collaboration message. Screening and compression of status information. The lower the correlation between the first status information and the first collaboration information, the more relevant information is selected by the first model. The lower the correlation between the status information and the first collaboration message, the higher the first reward.
  • the first reward can evaluate the first model, and train and learn the first model according to the feedback of the first reward, so that the first model can be directly evaluated, and the network model can be continuously optimized according to the reward.
  • this application also provides a second implementation manner of the first aspect:
  • the reward may also include a second reward, and the second reward is the relevance of the first collaboration message, the first collaboration action, and the second collaboration action.
  • the first collaboration information is used as the input of the second model of the first agent and other agents to guide the selection of the agent’s collaborative actions.
  • the first collaborative message, the first collaborative action and the second collaborative action are more relevant, the first The higher the second reward, the higher the first reward and the second reward, and the higher the reward.
  • the reward is not only to evaluate the completion of the task, but also to evaluate the correlation between the first status information and the first collaboration message, and the correlation between the first collaboration message and the collaboration action; feedback through this reward mechanism is continuous Optimize the network model to obtain better first collaboration messages, first collaboration actions, and second collaboration actions. That is, the network model can obtain more useful information from the status information, that is, the first collaboration message, and then according to the first collaboration message Get the best collaborative action of each agent, so as to better complete the collaborative task.
  • this application also provides a third implementation manner of the first aspect:
  • the state information includes the first state information obtained by the first agent from the environment, and also includes the second state information obtained by other agents from the environment, where the second state information is used to determine the second collaboration message of the other agent .
  • this application also provides a fourth implementation manner of the first aspect:
  • the collection of state information by multiple agents from the environment is also determined by the task.
  • the collection tasks are allocated according to a certain evaluation mechanism.
  • the first agent obtains the first from the environment according to the evaluation mechanism of the collaborative task.
  • State information the second agent obtains the second state information from the environment according to the same evaluation mechanism, and then each agent processes the obtained state information, and obtains the cooperative message for mutual communication, so that each agent can observe To the overall status information, better collaboration to complete tasks.
  • this application also provides a fifth implementation manner of the first aspect:
  • the first agent may also include a screening model, located between the first model and the second model, for screening the first collaboration message and the second collaboration message, that is, the first collaboration message and the second collaboration message are passed in the second model.
  • a screening model located between the first model and the second model, for screening the first collaboration message and the second collaboration message, that is, the first collaboration message and the second collaboration message are passed in the second model.
  • the first collaboration message and the second collaboration message input to the second model are processed, including redundant information deletion, error information modification, etc.; this can make the input of the second model more accurate and concise. Reduce the complexity of the intensive training of the second model.
  • this application also provides a sixth implementation manner of the first aspect:
  • the communication module can also be used to transfer the collaboration message.
  • the first agent encodes the first collaboration message through the communication module, and then sends the code to other agents
  • other agents decode it through the communication model, and then obtain the first collaboration message.
  • the second aspect of the present application provides a first agent, including:
  • the processing unit is configured to process the first state information obtained from the environment through the first model to obtain the first collaboration message.
  • a sending unit configured to send the first collaboration message to at least one second agent
  • the processing unit is further configured to process the first collaboration message and the second collaboration message through a second model to obtain the first collaboration action performed by the first agent, and the second collaboration message is generated by the Said at least one second agent sent.
  • the first model and the second model are determined based on the same reward; the first collaboration message is also used to determine the second collaboration action to be performed by the at least one second agent; the reward and status The information, the first collaboration message, the second collaboration action and the first collaboration action are related; the state information includes the first state information.
  • this application also provides the first implementation manner of the second aspect:
  • the reward is an evaluation of the task completion degree of the first agent and the at least one second agent based on the same collaborative task, and the reward includes a first reward and/or a second reward.
  • the reward is the correlation between the status information and the first collaboration message; wherein, the lower the correlation between the status information and the first collaboration message, the higher the first reward.
  • this application also provides a second implementation manner of the second aspect:
  • the second reward is the correlation degree of the first collaboration message, the first collaboration action, and the second collaboration action; wherein, the first collaboration message, the first collaboration action, and the second collaboration action The higher the degree of relevance of the collaborative action, the higher the second reward.
  • this application also provides a third implementation manner of the second aspect:
  • the state information further includes second state information, and the second state information is used by the at least one second agent to obtain the second collaboration message according to the second state information.
  • this application also provides a fourth implementation manner of the second aspect:
  • the obtaining unit is configured to obtain the first state information from the environment according to the evaluation mechanism of the collaboration task; wherein the second state information is that the at least one second agent obtains the information from the environment according to the same evaluation mechanism of the collaboration task. Obtained.
  • this application also provides a fifth implementation manner of the second aspect:
  • the receiving unit is configured to receive the second collaboration message through a screening model, and the screening model is used to screen according to the first collaboration message and the second collaboration message.
  • this application also provides a sixth implementation manner of the second aspect:
  • the sending unit is specifically configured to encode the first collaboration message through a communication model; the sending unit sends the encoded first collaboration message to the at least one second agent, so that the at least one The second agent decodes the encoded first collaboration message through the communication model to obtain the first collaboration message.
  • the third aspect of the present application provides an agent, including: at least one processor and a memory, the memory stores computer-executable instructions that can run on the processor, and when the computer-executable instructions are executed by the processor, the The agent executes the method described in the foregoing first aspect or any one of the possible implementation manners of the first aspect.
  • the fourth aspect of the present application provides a multi-agent collaboration system, including: a first agent and at least one second agent, the first agent and the at least one second agent perform the same as the above-mentioned first aspect To the method described in any possible implementation manner of the first aspect.
  • the fifth aspect of the embodiments of the present application provides a computer storage medium, which is used to store computer software instructions used by the above-mentioned agent, and includes a program for executing a program designed for the agent.
  • the agent may be the first agent described in the second aspect.
  • the sixth aspect of the present application provides a chip or chip system.
  • the chip or chip system includes at least one processor and a communication interface.
  • the communication interface and the at least one processor are interconnected by wires, and the at least one processor is used to run computer programs or instructions, To perform the decision-making method of an agent action described in any one of the possible implementation manners of the first aspect to the first aspect;
  • the communication interface in the chip can be an input/output interface, a pin, or a circuit.
  • the chip or chip system described above in this application further includes at least one memory, and instructions are stored in the at least one memory.
  • the memory may be a storage unit inside the chip, for example, a register, a cache, etc., or a storage unit of the chip (for example, a read-only memory, a random access memory, etc.).
  • the seventh aspect of the present application provides a computer program product, the computer program product includes computer software instructions, the computer software instructions can be loaded by a processor to implement any one of the above-mentioned first aspect of the agent action decision method Process.
  • This application processes the status information collected in the environment to obtain the first collaboration message, and then processes the first collaboration message to obtain the first collaboration action, so that the original parallel network model that simultaneously obtains the collaboration message and the collaboration action based on the status information become a serial network model.
  • rewards are not only related to tasks, but also related to status information, collaborative messages, and collaborative actions. This makes the structure of each network model simpler.
  • the new reward mechanism can directly affect the first model and the first model.
  • the evaluation of the second model reduces the training complexity of the neural network model and improves the completion of the collaboration task.
  • Figure 1 is a schematic diagram of agent reinforcement learning provided by an embodiment of this application.
  • Figure 2 is a schematic diagram of multi-agent reinforcement learning provided by an embodiment of this application.
  • FIG. 3 is a schematic diagram of the structure of a fully connected neural network provided by an embodiment of the application.
  • FIG. 4 is a network architecture diagram corresponding to the method for decision-making of an agent action provided by an embodiment of the application
  • FIG. 4A is another network architecture diagram corresponding to the method for determining an agent action provided by an embodiment of the application.
  • FIG. 4B is another network architecture diagram corresponding to the method for decision-making of an agent action provided by an embodiment of this application;
  • FIG. 5 is a schematic flowchart of a method for decision-making of an agent action provided by an embodiment of the application
  • FIG. 6 is a training framework diagram corresponding to the first model and the second model provided by an embodiment of the application.
  • FIG. 7 is another training framework diagram corresponding to the first model and the second model provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of an agent provided by an embodiment of the application.
  • Fig. 9 is a schematic structural diagram of another agent provided by an embodiment of the application.
  • This application provides an agent action decision-making method to solve the problem of difficulty in training a neural network model caused by using task rewards to indirectly determine the content of communication.
  • FIG. 1 is a schematic diagram of the reinforcement learning of the agent provided by the embodiment of this application.
  • Reinforcement learning is the way that the agent interacts with the environment to learn.
  • the agent receives the status information feedback from the environment, makes actions, and then obtains rewards. The state at the next moment.
  • the reward is based on the evaluation of the action made by the task itself.
  • the greater the reward accumulated by the agent the better it proves that the agent performs actions on the environment; the agent continuously learns and adjusts actions , And finally gain knowledge based on rewards, and improve the action plan to adapt to the environment.
  • Multi-agent reinforcement learning refers to multiple agents interacting with the environment at the same time, taking actions on the state of the environment, and completing collaborative tasks together; for example, multi-agent reinforcement learning Joint scheduling of base stations, joint formation of multiple cars in autonomous driving, joint multi-user transmission from device to device, etc. Since it is difficult for a single agent to observe the global state information, to realize the collaboration between multi-agents and accomplish tasks better, multi-agent joint learning is required, and multi-agent joint learning needs to exchange information through communication.
  • the communication between multi-agents needs to solve the problem of transmission content and transmission method.
  • the traditional communication method is designed to transmit content and transmission method separately.
  • the transmission content is usually all the states observed by an agent, such as images or videos collected by the camera of an autonomous car, the channel state of users in the cell collected by the base station, and data collected by various sensors.
  • the agent needs to transmit these data to other agents, and the communication method uses Shannon's communication framework, including source and channel coding. Among them, source coding realizes compression of the source to reduce communication overhead; channel coding increases redundancy to combat interference in the communication medium.
  • This communication method does not filter the communication content for the task and the status information observed by the receiver itself.
  • the worst case is that the sender needs to transmit all the status information it observes to all other agents, and the receiver must receive it. All the state information observed by all other agents can ensure that each agent observes the global state and takes the best action; therefore, the communication content will contain a large amount of redundant information, which will reduce the communication efficiency.
  • multi-agents can choose a learning-based communication method, that is, using the guidance of rewards in reinforcement learning, each agent autonomously learns the communication content needed to complete the task, and uses the neural network model to select the communication content. According to the guidance of task rewards, the neural network model is trained.
  • MLP multilayer perceptron
  • MLP x i of the input layer on the left side, right y i is the output layer
  • the middle layer is a hidden layer, and each layer includes several nodes, called neurons; among them, the neurons in two adjacent layers are connected in pairs.
  • the neural network model can be understood as a mapping relationship from the input data set to the output data set.
  • the neural network model is initialized randomly.
  • the training of the neural network model is the process of continuously optimizing the weight matrix w and the bias vector b.
  • the specific method is to use the loss function to evaluate the output result of the neural network, and the error Backpropagation, iteratively optimize w and b through the gradient descent method to reduce the loss function, and then use the trained and optimized w and b to process the input and get the output; among them, the loss function is related to the task and is for the task An evaluation.
  • multi-agents use neural network models to input state information obtained from the environment, output their own collaborative actions and collaborative messages transmitted to others, and use rewards to evaluate the output and evaluate the neural network.
  • the network model is trained, and the goals of reinforcement learning are:
  • r t is the reward
  • is the discount factor
  • is the strategy, including the strategy on the cooperative action and the strategy on the cooperative message.
  • s i , m -i ) means that the agent performs action a i when the state is s i and the message received from other agents is m -i ; ⁇ (m i
  • the reward evaluation of the collaboration message is indirect, which will make it very difficult to learn the content of the collaboration message based on the reward.
  • outputting collaborative messages and collaborative actions in parallel based on state information will lead to a significant increase in the size of the neural network model, making neural network model training extremely difficult; on the other hand, the goal of reinforcement learning is only rewards without considering communication overhead, which may be It will cause the learned communication content to have a large dimension, which cannot be applied to actual communication scenarios.
  • FIG. 4 is a network architecture diagram corresponding to the agent action decision method in the embodiment of this application.
  • the first agent obtains the first state information s1 from the environment, and the first state information s1 Enter the first model of the first agent to generate the first collaboration message m1;
  • the second agent obtains the second state information s2 from the environment, and enters the second state information into the first model of the second agent to generate the first model Two collaboration message m2; then, the first agent receives the second collaboration message m2 sent by the second agent, and uses the first collaboration message m1 and the second collaboration message m2 as the input of the second model corresponding to the first agent,
  • the second model processes it to generate the first collaboration action a1; in the same way, the second agent receives the first collaboration message m1 sent by the first agent, and uses the first collaboration message m1 and the second collaboration message m2 as the first The input of the second model corresponding to the two agents, and the second model processes it to generate the second cooperative action a
  • This embodiment adopts a string network structure.
  • the first model completes the extraction of status information, and extracts information useful for the collaborative actions of multiple agents according to the collaboration task to generate a collaboration message; this collaboration message is used for the current agent.
  • the action decision is also the communication content transmitted to other agents, which affects the collaborative action decision of other agents.
  • the second model inputs the collaboration messages generated by the first model of the agent and the collaboration messages sent by other agents, and outputs the collaboration actions that the agent needs to perform. In this way, the structure of each neural network model is simpler, and the first model and the second model can be directly trained by changing the reward mechanism.
  • the specific method steps are shown in Figure 5:
  • the first agent processes the first state information obtained from the environment through the first model to obtain the first collaboration message.
  • the state information is based on the environmental characteristics observed by the collaborative task.
  • the state information can be the road condition observed by the car, the obstacle image observed by the car, etc.; device to device
  • the state information may be the channel resource and channel loss value obtained by the device.
  • the agent determines the action plan to complete the collaborative task to adapt to the environment. Therefore, it first needs to obtain state information from the environment.
  • each agent has the task of observing the state of the environment; optional , Multiple agents allocate the status information acquisition according to the collaboration task itself, that is, the first agent needs to obtain the first status information from the environment according to a certain evaluation mechanism of the collaboration task; the second agent also needs to obtain the first status information according to the same evaluation The mechanism obtains the second state information from the environment.
  • the evaluation mechanism can include the evaluation of the distance between the vehicles. Therefore, the state of the leading car from the environment can be road condition information, etc., and the following cars need not only monitor the road conditions, but also need Monitor the position of the previous car.
  • each car needs to obtain traffic jam information from the environment, perceive the dynamics of approaching vehicles and pedestrians, obtain traffic signal information, perceive the distance between vehicles, and determine each vehicle needs to be monitored according to the task Design a reasonable state information acquisition strategy to achieve the effect of rational use of resources and rational completion of tasks.
  • the first agent monitors the environment according to the evaluation mechanism, obtains useful first state information, and then processes the first state information, that is, learns the information useful for the task in the first state information through the first model, and obtains the first state information.
  • Cooperation information for example, the structure and dimensions of the first model are determined by the data structure of the first state information.
  • the state information is an image
  • the first model can use a convolutional network.
  • the state information is a certain channel value
  • the first model uses a fully connected neural network, and the specific form is not limited.
  • the first collaboration information is not only used for the action decision of the first agent, but also is the communication information for mutual interaction and communication between multiple agents, so that other agents can obtain global environment state information, that is, it is also used for the second agent. Action decision.
  • the first agent sends a first collaboration message to at least one second agent.
  • the second collaboration action of the second agent needs to be determined by the first collaboration message. Therefore, the first agent needs to send the first collaboration message to the second agent.
  • the first agent and the second agent The two agents can transmit information through the communication model.
  • FIG. 4A is another network architecture diagram corresponding to the agent action decision method in the embodiment of this application, as shown in FIG. 4A: when the first agent transmits the first collaboration message m1 to the second agent, The first collaboration message m1 is first transmitted to the communication model of the first agent, and then the communication model of the first agent is transmitted to the communication model of the second agent through the channel.
  • the communication model can include encoders and decoders to solve problems such as channel interference and noise.
  • the encoder in the communication model first encodes the first collaboration message m1, and then sends it to the second agent.
  • the communication module of the second agent receives the encoded first collaboration message m1, and the decoder in the communication module Decode it and complete the data transmission.
  • the communication module is used to reduce communication overhead and ensure that the data transmission is true and reliable.
  • the neural network can continuously learn new knowledge and continuously adjust the coding scheme. It can be jointly trained with the first model and the second model, or can be used as an independent network , Complete the communication task, and conduct training alone.
  • the first agent receives a second collaboration message sent by at least one second agent.
  • FIG. 4B is another network architecture diagram corresponding to the agent action decision method in this embodiment of the application.
  • the first agent receives the second collaboration message sent by at least one second agent .
  • the first collaboration message and the second collaboration message need to be processed through the second model.
  • the first collaboration message and the second collaboration message can also be processed through the screening module.
  • the collaborative messages are screened, and the screened information is input to the second model.
  • the screening model can delete the repeated information of the first collaboration message and the second collaboration message, and can compare the first collaboration message and the second collaboration message to correct the error information, which is not specifically limited.
  • the first agent may send the first collaboration message first and then receive the second collaboration message, or it may first receive the second collaboration message and then send the first collaboration message.
  • the specific form is not limited.
  • the first agent processes the first collaboration message and the second collaboration message through the second model to obtain the first collaboration action performed by the first agent.
  • the second model is used for action decision-making.
  • the agent determines the collaborative action to be completed by the agent itself, and then obtains rewards according to the task evaluation mechanism, and learns new knowledge according to the rewards, and constantly adjusts Collaborative action; the final cumulative maximum reward determines the action plan of the agent.
  • the first model and the second model are determined by the same reward; that is, the goals of the reinforcement learning of the first model and the second model are the same, and both can be:
  • I(M; A) is the mutual information between the collaboration message and the collaboration action.
  • the maximization item is to extract the most relevant collaboration information from the status information; I(S; M; ) Is the mutual information between the collaboration message and the status information. Minimize this item to realize the compression of the acquired status information and eliminate the acquisition of status information that is not related to the collaboration action.
  • the goals of reinforcement learning in the first model and the second model can only include task rewards
  • the mutual information I(M;A) between the collaboration message and the collaboration action can also only include the reward of the task
  • the mutual information I(S; M) between the collaboration message and the status information may also include the three, which is not specifically limited.
  • the reward is related to the status information, the first collaboration message, the second collaboration action, and the first collaboration action; it is the overall evaluation of the status information, collaboration message, and collaboration action, that is, when the agent determines a collaboration action,
  • the system needs to evaluate and determine the final reward based on the correlation between the state information obtained by the agent and the first collaboration message, and the correlation between the first collaboration message and the first collaboration action and the second collaboration action.
  • the first model and the second model continue to learn and determine multiple action plans.
  • the action plans include the acquisition of status information, the generation of the first collaboration message, and the determination of the first and second collaboration actions .
  • Each action plan corresponds to a reward, and the action plan corresponding to the largest reward will determine the most effective observation of status information, determine the most useful collaboration message, and the most appropriate collaborative action.
  • the reward is an evaluation of the task completion degree of the first agent and at least one second agent based on the same collaboration task.
  • the role of the first model is to extract the most useful information for the task from the status information to generate the first collaboration information, that is, the first model needs to complete the screening and compression of the status information, when the first model extracts the most effective When information, its correlation with status information is also lower.
  • status information includes ten different aspects of information data.
  • the first model screens out only three data information that is most relevant to the task, so that it can be based on this. Three pieces of data information get the first collaboration message.
  • the first collaboration message has a low correlation with the status information, and its reward is high; the second model is used to get the most useful action based on the collaboration message, that is, the first collaboration message and the first collaboration message The higher the correlation between the first collaborative action and the second collaborative action, the higher the reward.
  • the first collaboration message is obtained by processing the status information collected in the environment, and then the first collaboration message is processed to obtain the first collaboration action, so that the original parallel network that simultaneously obtains the collaboration message and the collaboration action based on the status information
  • the model becomes a serial network model.
  • rewards are not only related to tasks, but also related to status information, collaborative messages, and collaborative actions. This makes the structure of each network model simpler.
  • the new reward mechanism can directly affect the first model and The second model is evaluated, which reduces the training complexity of the neural network model and improves the completion of the collaboration task.
  • FIG. 6 is a training framework diagram corresponding to the first model and the second model in this embodiment of the application, as shown in FIG. 6:
  • the training framework includes a first agent, a second agent, a shared communication evaluation network and action evaluation network, as well as the first model and the second model of each agent; the training process is:
  • the first agent obtains the first state information s1 from the environment, and the second agent obtains the first state information s2 from the environment, and then based on the common communication evaluation
  • the network inputs the first state information s1 into the first model of the first agent, and the second state information s2 into the first model of the second agent, to obtain the first cooperation message m1 and the second cooperation message m2 respectively;
  • the cooperation message m1 and the second cooperation message m2 are transmitted to the common action evaluation network, and the common action evaluation network transmits information to the second model of the first agent and the second model of the second agent.
  • the second model processes m1 and m2 respectively, and obtains the first cooperative action a1 and the second cooperative action a2 respectively, and then obtains rewards according to a1 and a2, and at the same time obtains the state information at the next moment.
  • the number of cycles reaches the number of iterations, the total reward is calculated and the training model is saved, and then the next training is performed.
  • the best model is selected from multiple training models to determine the best parameters.
  • FIG. 7 is another training framework diagram corresponding to the first model and the second model in this embodiment of the application, as shown in FIG. 7:
  • the training framework includes the first agent, the second agent, the first model and the second model of each agent; the training process is:
  • the first agent obtains the first state information s1 from the environment, and the second agent obtains the first state information s2 from the environment, and then the first state information s1 is input to the first model of the first agent, and the second state information s2 is input to the first model of the second agent to obtain the first collaboration message m1 and the second collaboration message m2;
  • the message m1 is transmitted to the second model of the first agent and the second agent, and the second agent transmits the second collaboration message m2 to the second model of the second agent and the first agent; the second agent of the first agent
  • the model processes m1 and m2 to obtain the first collaborative action a1, and the second model of the second agent processes m1 and m2 to obtain the second collaborative action a2, and then obtains rewards based on a1 and a2, and obtains the next Status information at a moment.
  • the number of cycles reaches the number of iterations, the total
  • FIG. 8 is a schematic structural diagram of a first agent provided by an embodiment of the present application.
  • the first agent 800 includes:
  • the processing unit 801 is configured to process the first state information obtained from the environment through the first model to obtain the first collaboration message.
  • the sending unit 802 is configured to send the first collaboration message to at least one second agent.
  • the processing unit 801 is further configured to process the first collaboration message and the second collaboration message through a second model to obtain the first collaboration action performed by the first agent, and the second collaboration message is determined by The at least one second agent sent.
  • the first model and the second model are determined based on the same reward; the first collaboration message is also used to determine the second collaboration action to be performed by the at least one second agent; the reward and status The information, the first collaboration message, the second collaboration action and the first collaboration action are related; the state information includes the first state information.
  • the reward is an evaluation of the task completion degree of the first agent 800 and the at least one second agent based on the same collaborative task, and the reward includes the first reward and/or The second reward, the first reward is the correlation between the status information and the first collaboration message; wherein, the lower the correlation between the status information and the first collaboration message, the higher the first reward high.
  • the second reward is the correlation degree of the first collaboration message, the first collaboration action, and the second collaboration action; wherein, the first collaboration message, the first collaboration action, and the The higher the correlation between the collaboration action and the second collaboration action, the higher the second reward.
  • the state information further includes second state information, and the second state information is used by the at least one second agent to obtain the second collaboration message according to the second state information.
  • the first agent 800 further includes an acquiring unit 803.
  • the obtaining unit 803 is configured to obtain the first state information from the environment according to the evaluation mechanism of the collaboration task; wherein the second state information is the same evaluation of the at least one second agent according to the collaboration task
  • the mechanism is obtained from the environment.
  • the first agent further includes a receiving unit 804.
  • the receiving unit 804 is configured to receive the second collaboration message through a screening model, and the screening model is configured to screen the second collaboration message according to the first collaboration message.
  • the sending unit 802 is specifically configured to encode the first collaboration message through a communication model; the sending unit 802 sends the encoded first collaboration message to the at least one second agent. A collaboration message, so that the at least one second agent decodes the encoded first collaboration message through the communication model to obtain the first collaboration message.
  • each unit of the first agent 800 please refer to the implementation details of the first agent in the method embodiment shown in FIG. 5, which will not be repeated here.
  • the first agent 900 includes: a central processing unit 901, a memory 902, and a communication interface 903.
  • the central processing unit 901, the memory 902, and the communication interface 903 are connected to each other via a bus; the bus may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. .
  • the bus can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used in FIG. 9, but it does not mean that there is only one bus or one type of bus.
  • the memory 902 may include volatile memory (volatile memory), such as random-access memory (random-access memory, RAM); the memory may also include non-volatile memory (non-volatile memory), such as flash memory (flash memory). ), a hard disk drive (HDD) or a solid-state drive (solid-state drive, SSD); the storage 602 may also include a combination of the above types of storage.
  • volatile memory such as random-access memory (random-access memory, RAM
  • non-volatile memory such as flash memory (flash memory).
  • flash memory flash memory
  • HDD hard disk drive
  • SSD solid-state drive
  • the central processor 901 may be a central processing unit (CPU), a network processor (English: network processor, NP), or a combination of CPU and NP.
  • the central processing unit 901 may further include a hardware chip.
  • the above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.
  • the communication interface 903 may be a wired communication interface, a wireless communication interface, or a combination thereof, where the wired communication interface may be, for example, an Ethernet interface.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the wireless communication interface may be a WLAN interface, a cellular network communication interface, or a combination thereof.
  • the memory 902 may also be used to store program instructions.
  • the central processing unit 901 calls the program instructions stored in the memory 902, and may execute one or more steps in the method embodiment shown in FIG.
  • the implementation manner enables the first agent 900 to implement the function of the agent in the foregoing method, and details are not described herein again.
  • the embodiment of the present application also provides a multi-agent collaboration system, including: a first agent and at least one second agent; the first agent and at least one second agent perform any one of the tasks shown in FIG. 5 above.
  • the decision-making method of the action of an agent is not limited to: a multi-agent collaboration system, including: a first agent and at least one second agent; the first agent and at least one second agent perform any one of the tasks shown in FIG. 5 above.
  • the decision-making method of the action of an agent including: a multi-agent collaboration system, including: a first agent and at least one second agent; the first agent and at least one second agent perform any one of the tasks shown in FIG. 5 above.
  • the decision-making method of the action of an agent is not limited to: a multi-agent collaboration system, including: a multi-agent collaboration system, including: a first agent and at least one second agent; the first agent and at least one second agent perform any one of the tasks shown in FIG. 5 above.
  • the embodiment of the present application also provides a chip or chip system.
  • the chip or chip system includes at least one processor and a communication interface.
  • the communication interface and the at least one processor are interconnected through a wire.
  • One or more steps in the method embodiment shown in FIG. 5, or optional implementation manners thereof, are used to realize the function of the agent in the above method.
  • the communication interface in the chip can be an input/output interface, a pin, or a circuit.
  • the chip or chip system described above further includes at least one memory, and instructions are stored in the at least one memory.
  • the memory may be a storage unit inside the chip, for example, a register, a cache, etc., or a storage unit of the chip (for example, a read-only memory, a random access memory, etc.).
  • the embodiment of the present application also provides a computer storage medium, and the computer storage medium stores computer program instructions for realizing the agent function in the agent action decision method provided in the embodiment of the present application.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product includes computer software instructions, and the computer software instructions can be loaded by a processor to implement the process in the method for determining the action of the agent shown in FIG. 5.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Automation & Control Theory (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Multi Processors (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

一种智能体动作的决策方法及相关设备,用于通信技术领域。方法包括:第一智能体通过第一模型,对从环境中获取的第一状态信息进行处理,得到第一协作消息(501);第一智能体向至少一个第二智能体发送第一协作消息(502);第一智能体接收至少一个第二智能体发送的第二协作消息(503);第一智能体通过第二模型,对第一协作消息和第二协作消息进行处理,得到第一智能体执行的第一协作动作(504),第二协作消息由至少一个第二智能体发送;这样使得原来根据状态信息同时得到协作消息和协作动作的并行网络模型变成串行网络模型,每一个网络模型的结构更加简单,神经网络模型的训练也更加容易。

Description

智能体动作的决策方法及相关设备
本申请要求于2020年4月17日提交中国专利局、申请号为202010306503.7、发明名称为“智能体动作的决策方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及通信技术领域,尤其涉及一种智能体动作的决策方法及相关设备。
背景技术
随着通信技术的不断发展,多个智能体相互协作,共同完成一项任务的场景越来越多,多智能体需要同时与环境进行交互,各自完成相适应的动作以此完成规定的任务,如多基站的联合调度、自动驾驶中多汽车的联合编排等;多智能体强化学习的需要也越来越多,强化学习即指智能体以与环境交互的方式进行学习,根据环境反馈的状态对环境作出动作,从而获得奖励并根据奖励机制来获得知识,从而改进智能体对环境作出的反应。
在多智能体强化学习中,由于单个智能体通常很难观测到全局的状态信息,因此各智能体之间需要通信来交换状态信息,来实现多智能的协作,多智能体之间相互通信就需要确定通信内容和通信方法,前者解决传输的内容,后者解决传输的途径;现有技术中,智能体通常采用神经网络模型来确定通信内容,该模型的输入为环境中的状态,输出传给其他网络的通信信息和自身的执行动作,然后根据整个协作任务的奖励来作为对神经网络的反馈评价,根据该评价来训练神经网络模型。
由于通信信息不直接影响协作任务的完成,而是通过影响每个智能体的动作来间接影响协作任务对应的奖励,这样利用任务奖励来引导通信信息的学习会使得训练模型更加困难,并且同时输出信息和动作会使得网络规模显著增大,模型训练困难的问题亟需解决;另一方面强化学习的目标中没有考虑通信开销,这可能会导致学习到的通信内容维度较大,无法适用于实际通信场景,特别是带宽受限的无线通信。
发明内容
本申请提供了一种智能体动作的决策方法,用于解决利用任务奖励来间接确定通信内容而导致的神经网络模型训练困难的问题;将通信对动作的影响和通信维度作为学习目标的一部分,从而解决通信难以学习和智能体之间通信开销的问题。
本申请的第一方面提供了一种智能体动作的决策方法,包括:
在多智能体协作的场景下,第一智能体从环境中获取第一状态信息,并通过第一模型对获取到的第一状态信息进行处理,得到第一协作消息,该第一协作消息不仅用于第一智能体自身动作的决策,还是向其他智能体传输的通信信息;然后第一智能体将第一协作消息和从其他智能体接收到的第二协作消息输入至第二模型,得到第一智能体所需要执行的第一协作动作;其中,第一模型和第二模型在训练过程中,都根据相同的任务奖励来进行学习,该奖励与每一个智能体获取到的状态信息、得到的协作消息以及最终确定的每个智能体对应的协作动作相关,是对其进行的评价,多个智能体可以根据该评价来训练第一模型和第二模型。
本申请利用第一模型对第一状态信息进行处理,得到第一协作消息,然后再利用第二模型,对第一协作消息进行处理得到第一协作动作,使得原来根据状态信息同时得到协作 消息和协作动作的并行网络模型变成串行网络模型,同时奖励除了与任务相关,也与状态信息、协作消息以及协作动作相关,这样使得每一个网络模型的结构更加简单,同时新的奖励机制可以直接对第一模型和第二模型进行评价,神经网络模型的训练也更加容易。
基于第一方面,本申请还提供了第一方面的第一种实施方式:
奖励是对基于同一协作任务的多个智能体之间,完成该任务的结果进行的评价,是与任务本身相关的一种评价机制,所以针对某一项任务,任务完成度越高,目标实现度越高,奖励越高,同时奖励机制还需要包括第一奖励和第二奖励,第一奖励为状态信息和第一协作消息的相关度,其中,第一协作消息是第一模型对第一状态信息的筛选与压缩,第一状态信息与第一协作信息相关度越低,则说明第一模型选择出最相关的信息,筛除了第一状态信息中大量与任务无关的信息,即第一状态信息与第一协作消息的相关度越低,第一奖励越高。
其中,第一奖励可以对第一模型进行评价,根据反馈的第一奖励来对第一模型进行训练学习,这样可以直接对第一模型进行评价,根据奖励不断优化网络模型。
基于第一方面的第一种实施方式,本申请还提供了第一方面的第二种实施方式:
奖励还可以包括第二奖励,第二奖励为第一协作消息、第一协作动作和第二协作动作的相关度。第一协作信息作为第一智能体和其他智能体的第二模型的输入,指导智能体协作动作的选择,当第一协作消息、第一协作动作和第二协作动作的相关度越高,第二奖励越高,第一奖励和第二奖励越高,奖励就越高。
在本申请中,奖励不仅要对任务完成度进行评价,还需要评价第一状态信息和第一协作消息的相关度,第一协作消息与协作动作的相关度;通过该种奖励机制的反馈不断优化网络模型,可以获得更优的第一协作消息、第一协作动作以及第二协作动作,即网络模型可以从状态信息中获取到更有用的信息即第一协作消息,然后根据第一协作消息得到每个智能体最佳的协作动作,从而能更好的完成协作任务。
基于第一方面至第一方面的第二种实施方式,本申请还提供了第一方面的第三种实施方式:
状态信息包括第一智能体从环境中获取到的第一状态信息,还包括其他智能体从环境中获取到的第二状态信息,其中第二状态信息用于确定其他智能体的第二协作消息。
基于第一方面的第三种实施方式,本申请还提供了第一方面的第四种实施方式:
在多智能体协作的场景下,多个智能体从环境中采集状态信息也是由任务决定,根据一定的评价机制来分配采集任务,第一智能体根据协作任务的评价机制从环境中获取第一状态信息,第二智能体根据同一评价机制从环境中获得第二状态信息,然后每一个智能体别对获取到的状态信息进行处理,得到相互通信的协作消息,这样每一个智能体都可以观测到全局的状态信息,更好的协作完成任务。
基于第一方面至第一方面的第四种实施方式,本申请还提供了第一方面的第五种实施方式:
第一智能体还可以包括一个筛选模型,处于第一模型与第二模型之间,用于对第一协作消息和第二协作消息进行筛选,即在第二模型通过第一协作消息和第二协作消息得到第一协作动作之前,对输入第二模型的第一协作消息和第二协作消息进行处理,包括冗余信 息删除,错误信息修改等;这样可以使得第二模型的输入更加准确简洁,减少第二模型的强化训练的复杂度。
基于第一方面至第一方面的第五种实施方式,本申请还提供了第一方面的第六种实施方式:
在两个智能体之间相互通信发送协作消息时,还可以通过通信模块来进行协作消息的传递,首先,第一智能体通过通信模块对第一协作消息进行编码,然后向其他智能体发送编码后的第一协作消息,然后其他智能体通过该通信模型,再对其进行解码,然后获取第一协作消息。
一般的智能体在相互通信传递信息时,为了应对信道变换,需要增加冗余来对抗通信介质的干扰,如果利用神经网络对信息进行传递,并且根据任务对通信模型进行强化训练,这样可以得到对任务更有用的通信内容,提高了通信效率。
本申请的第二方面提供了一种第一智能体,包括:
处理单元,用于通过第一模型,对从环境中获取的第一状态信息进行处理,得到第一协作消息。
发送单元,用于向至少一个第二智能体发送所述第一协作消息;
所述处理单元,还用于通过第二模型,对所述第一协作消息和第二协作消息进行处理,得到所述第一智能体执行的第一协作动作,所述第二协作消息由所述至少一个第二智能体发送。
其中,所述第一模型和所述第二模型根据相同的奖励确定;所述第一协作消息还用于确定所述至少一个第二智能体需执行的第二协作动作;所述奖励与状态信息、所述第一协作消息、所述第二协作动作和所述第一协作动作相关;所述状态信息包括所述第一状态信息。
基于第二方面,本申请还提供了第二方面的第一种实施方式:
所述奖励是对基于同一协作任务的所述第一智能体与所述至少一个第二智能体的任务完成度的评价,所述奖励包括第一奖励和/或第二奖励,所述第一奖励为所述状态信息和所述第一协作消息的相关度;其中,所述状态信息和所述第一协作消息的相关度越低,所述第一奖励越高。
基于第二方面的第一种实施方式,本申请还提供了第二方面的第二种实施方式:
所述第二奖励为所述第一协作消息、所述第一协作动作和所述第二协作动作的相关度;其中,所述第一协作消息、所述第一协作动作和所述第二协作动作的相关度越高,所述第二奖励越高。
基于第二方面至第二方面的第二种实施方式,本申请还提供了第二方面的第三种实施方式:
所述状态信息还包括第二状态信息,所述第二状态信息用于所述至少一个第二智能体根据所述第二状态信息得到所述第二协作消息。
基于第二方面的第三种实施方式,本申请还提供了第二方面的第四种实施方式:
获取单元,用于根据协作任务的评价机制从环境中获取所述第一状态信息;其中,所述第二状态信息为所述至少一个第二智能体根据所述协作任务的同一评价机制从环境中获 得。
基于第二方面至第二方面的第四种实施方式,本申请还提供了第二方面的第五种实施方式:
接收单元,用于通过筛选模型接收所述第二协作消息,所述筛选模型用于根据所述第一协作消息和所述第二协作消息进行筛选。
基于第二方面至第二方面的第五种实施方式,本申请还提供了第二方面的第六种实施方式:
所述发送单元具体用于通过通信模型对所述第一协作消息进行编码;所述发送单元向所述至少一个第二智能体发送编码后的所述第一协作消息,以使得所述至少一个第二智能体通过所述通信模型对编码后的所述第一协作消息进行解码,获取所述第一协作消息。
本申请第三方面提供一种智能体,包括:至少一个处理器和存储器,存储器存储有可在处理器上运行的计算机执行指令,当所述计算机执行指令被所述处理器执行时,所述智能体执行如上述第一方面或第一方面任意一种可能的实现方式所述的方法。
本申请第四方面提供了一种多智能体协作系统,包括:第一智能体和至少一个第二智能体,所述第一智能体和所述至少一个第二智能体执行如上述第一方面至第一方面的任一种可能的实现方式中所述的方法。
本申请实施例第五方面提供了一种计算机存储介质,该计算机存储介质用于储存为上述智能体所用的计算机软件指令,其包括用于执行为智能体所设计的程序。
该智能体可以如前述第二方面所描述的第一智能体。
本申请第六方面提供了一种芯片或者芯片系统,该芯片或者芯片系统包括至少一个处理器和通信接口,通信接口和至少一个处理器通过线路互联,至少一个处理器用于运行计算机程序或指令,以进行第一方面至第一方面的任一种可能的实现方式中任一项所描述的智能体动作的决策方法;
其中,芯片中的通信接口可以为输入/输出接口、管脚或电路等。
在一种可能的实现中,本申请中上述描述的芯片或者芯片系统还包括至少一个存储器,该至少一个存储器中存储有指令。该存储器可以为芯片内部的存储单元,例如,寄存器、缓存等,也可以是该芯片的存储单元(例如,只读存储器、随机存取存储器等)。
本申请第七方面提供了一种计算机程序产品,该计算机程序产品包括计算机软件指令,该计算机软件指令可通过处理器进行加载来实现上述第一方面中任意一项的智能体动作的决策方法中的流程。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请通过对环境中采集的状态信息进行处理,得到第一协作消息,然后再对第一协作消息进行处理得到第一协作动作,使得原来根据状态信息同时得到协作消息和协作动作的并行网络模型变成串行网络模型,同时奖励除了与任务相关,也与状态信息、协作消息以及协作动作相关,这样使得每一个网络模型的结构更加简单,同时新的奖励机制可以直接对第一模型和第二模型进行评价,减少了神经网络模型的训练复杂度,提高了协作任务的完成度。
附图说明
图1为本申请实施例提供的智能体强化学习示意图;
图2为本申请实施例提供的多智能体强化学习示意图;
图3为本申请实施例提供的全连接神经网络的结构示意图;
图4为本申请实施例提供的智能体动作的决策方法对应的网络架构图;
图4A为本申请实施例提供的智能体动作的决策方法对应的另一个网络架构图;
图4B为本申请实施例提供的智能体动作的决策方法对应的另一个网络架构图;
图5为本申请实施例提供的智能体动作的决策方法的流程示意图;
图6为本申请实施例提供的第一模型和第二模型对应的训练框架图;
图7为本申请实施例提供的第一模型和第二模型对应的另一种训练框架图;
图8为本申请实施例提供的一种智能体的结构示意图;
图9为本申请实施例提供的另一种智能体的结构示意图。
具体实施方式
本申请提供了一种智能体动作的决策方法,用于解决利用任务奖励来间接确定通信内容而导致的神经网络模型训练困难的问题。
下面将结合本申请中的附图,对本申请中的技术方案进行详细描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
请参阅图1,为本申请实施例提供的智能体强化学习示意图,强化学习是智能体利用与环境交互的方式来进行学习,智能体接收环境反馈的状态信息,做出动作,然后获得奖励以及下一时刻的状态。其中,奖励是根据任务本身对动作做出的评价,在一段时间内,智能体积累到的奖励越大,证明智能体对环境状态所做出来动作越好;智能体不断的进行学习,调整动作,最终根据奖励获得知识,改进行动方案以适应环境。
请参阅图2,为本申请实施例提供的多智能体强化学习示意图,多智能体强化学习是指多个智能体同时与环境进行交互,对环境状态做出动作,共同完成协作任务;例如多基站联合调度、自动驾驶中多汽车的联合编队、设备到设备中的多用户联合传输等。由于单个智能体很难观测到全局的状态信息,要实现多智能体之间的协作,更好完成任务,就需要多智能体联合学习,多智能体联合学习即需要通过通信来交互信息。
多智能体之间通信,需要解决传输内容和传输方式的问题,传统的通信方式为传输内容和传输方式分开设计。传输内容通常为某个智能体所观测到的所有状态,如自动驾驶汽车摄像头采集的图像或者视频、基站收集的本小区用户的信道状态,各种传感器采集的数据等。智能体需要将这些数据传输给其他的智能体,通信方法则是采用香农的通信构架,包括信源和信道编码等。其中,信源编码实现对信源的压缩,以降低通信开销;信道编码 通过增加冗余,以对抗通信介质中的干扰。
这种通信方式没有针对任务和接收方自身观测到的状态信息对通信内容进行筛选,最坏的情况为发送方需要将自己全部观测到的状态信息传输给其他所有智能体,接收方必须接收到其他所有智能体观测的全部状态信息,这样才能保证每个智能体观测到全局状态,做出最佳动作;因此通信内容将含有大量的冗余信息,导致通信效率降低。
因此,多智能体可以选择基于学习的通信方式,即利用强化学习中奖励的引导,每个智能体自主学习到完成任务所需要的通信内容,利用神经网络模型对通信内容进行选择。根据任务奖励的引导,对神经网络模型进行训练。
如图3所示,为全连接神经网络的结构示意图,全连接神经网络又叫多层感知机(multilayer perceptron,MLP),MLP的左侧x i为输入层,右侧y i为输出层,中间多层则为隐藏层,每一层都包括数个节点,被称为神经元;其中相邻两层的神经元之间两两相连,以第一层为例,第二层每个神经元h为第一层的输出,依次推导,神经网络的最后的输出用矩阵表达可以递归为:y=f n(w nf n-1(...)+b n)。
其中w为权重矩阵,b为偏置向量,f为激活函数。因此,简单的说,可以将神经网络模型理解为一个从输入数据集合到输出数据集合的映射关系。而通常神经网络模型都是随机初始化的,神经网络模型的训练则是不断优化权重矩阵w和偏置向量b的过程,具体的方式是采用损失函数对神经网络的输出结果进行评价,并将误差反向传播,通过梯度下降的方法迭代优化w和b,以降低损失函数,然后利用训练优化后的w和b来对输入进行处理,得到输出;其中,损失函数与任务相关,是对任务的一种评价。
在多智能体联合协作的场景下,多智能体通过神经网络模型,输入从环境中获取的状态信息,输出自身的协作动作和传输给他人的协作消息,利用奖励来对输出进行评价,对神经网络模型进行训练,其中强化学习的目标为:
Figure PCTCN2021087642-appb-000001
其中,r t是奖励,γ是折扣因子,π是策略,包括关于协作动作的策略和关于协作消息的策略。其中,π(a i|s i,m -i)表示智能体在状态为s i,从其他智能体收到的消息为m -i时,执行动作a i;π(m i|s i)表示智能体在状态为时s i,生成协作消息m i传输给其他智能体;即奖励机制引导神经网络改变输出的协作动作和协作消息,不同的输出会获得不同的奖励,在一段时间内累积的奖励越大时,其对应的协作动作和协作消息越好。
由于奖励是对任务完成度进行的评价,而协作消息影响协作动作,协作动作直接影响任务完成,所以奖励对协作消息的评价是间接的,这将导致根据奖励学习协作消息内容是非常困难的,同时,根据状态信息并行输出协作消息和协作动作将导致神经网络模型规模显著增大,使得神经网络模型训练变得异常困难;另一方面强化学习的目标中只有奖励,没有考虑通信开销,这可能会导致学习到的通信内容维度较大,无法适用于实际通信场景。
请参阅图4,为本申请实施例中的智能体动作的决策方法对应的网络架构图,如图4所示,第一智能体从环境中获取第一状态信息s1,将第一状态信息s1输入至第一智能体的第一模型,生成第一协作消息m1;第二智能体从环境中获取第二状态信息s2,将第二状态信息输入至第二智能体的第一模型,生成第二协作消息m2;然后,第一智能体接收第二 智能体发送的第二协作消息m2,并且将第一协作消息m1和第二协作消息m2作为第一智能体对应的第二模型的输入,第二模型对其进行处理,生成第一协作动作a1;同理,第二智能体接收第一智能体发送的第一协作消息m1,并且将第一协作消息m1和第二协作消息m2作为第二智能体对应的第二模型的输入,第二模型对其进行处理,生成第二智能体的第二协作动作a2。
本实施例采用串型网络结构,第一模型完成状态信息的提取,根据协作任务从中抽取对多个智能体的协作动作来说有用的信息,生成协作消息;该协作消息既用于本智能体的动作决策,也是传输给其他智能体的通信内容,影响其他智能体的协作动作决策。第二模型输入本智能体第一模型生成的协作消息和其他智能体发送的协作消息,输出本智能体所需要执行的协作动作。这样,每个神经网络模型的结构更加简单,可以通过改变奖励机制直接对第一模型和第二模型进行训练,具体方法步骤如图5所示:
501、第一智能体通过第一模型,对从环境中获取的第一状态信息进行处理,得到第一协作消息。
状态信息是根据协作任务观测到的环境特征,例如,自动驾驶中多汽车的联合编队的协作任务中,状态信息可以是汽车观测到的路面状况,汽车观测到的障碍物图像等;设备到设备中的多用户联合传输的协作任务中,状态信息可以是设备获取到的信道资源、信道损耗值等。智能体确定行动方案完成协作任务以适应环境,因此,首先需要从环境中获取到状态信息。
在多智能体联合协作的场景下,单个智能体很难掌握全局状态信息,需要多个智能体联合获取状态信息并进行交互,因此,每个智能体都有观测环境状态的任务;可选的,多个智能体根据协作任务本身来分配状态信息的获取,即第一智能体需要根据协作任务的某一评价机制从环境中获取所述第一状态信息;第二智能体也需要根据同一评价机制从环境中获得第二状态信息。
示例性的,在多汽车编队协作的场景下,当为高速公路货车编队时,由于高速公路一般路线固定,障碍物和突发因素较少,因此高速公路货车编队的任务则为与带头货车到行动路线保持一致,各货车之间车距保持不变。因此,根据该协作任务,其评价机制则可以包括对各车之间车距的评价,因此带头的汽车从环境中获取的状态可以是路况信息等,后面的汽车则不仅需要监测路况,还需要监测前一个汽车的位置。
示例性的,当多汽车编队协作为自动协同驾驶时,交通环境的复杂多变将需要汽车感知、计算庞大的数据量,需要车辆全面、准确并迅速地评估实时的交通环境变化,然后作出合理的动作规划;因此,根据这一评价机制,各汽车需要从环境中获取交通拥堵信息、感知领近车辆和行人的动态、获取交通信号灯信息、感知车辆间距,并且根据任务确定每个车辆需要监控的范围等,设计合理的状态信息获取策略,达到合理利用资源、合理完成任务的效果。
第一智能体根据评价机制对环境进行监测,获取到有用第一状态信息,然后对第一状态信息进行处理,即通过第一模型学习到第一状态信息中对任务有用的信息,得到第一协作信息;示例性的,第一模型的结构和维度由第一状态信息的数据结构进行确定,例如状态信息为图像,第一模型就可以采用卷积网络,若状态信息是某一信道值,第一模型则采 用全连接神经网络,具体形式不做限定。该第一协作信息不仅用于第一智能体的动作决策,而且还为多智能体之间相互交互通信的通信信息,便于其他智能体获取到全局环境状态信息,即也用于第二智能体的动作决策。
502、第一智能体向至少一个第二智能体发送第一协作消息。
第二智能体的第二协作动作需要由第一协作消息来确定,因此第一智能体需要向第二智能体发送该第一协作消息,在一个优选的实施方式中,第一智能体和第二智能体之间可以通过通信模型来传输信息。请参阅图4A,为本申请实施例中的智能体动作的决策方法对应的另一个网络架构图,如图4A所示:第一智能体在向第二智能体传输第一协作消息m1时,先将第一协作消息m1传输至第一智能体的通信模型,然后由第一智能体的通信模型经过信道传输至第二智能体的通信模型。
通信模型作为信道适配器,可以包括编码器和解码器,用来解决信道干扰、噪声等问题。例如,通信模型中的编码器先对第一协作消息m1进行编码,然后向第二智能体发送,第二智能体的通信模块接收到编码后第一协作消息m1,由通信模块中的解码器对其进行解码,完成数据传输。
其中,通信模块用于减少通信开销,并且保证数据传输真实可靠,利用神经网络可以不断学习新知识,不断调整编码方案,其可以与第一模型和第二模型联合训练,也可以作为独立的网络,完成通信任务,单独进行训练。
503、第一智能体接收至少一个第二智能体发送的第二协作消息。
请参阅图4B,为本申请实施例中的智能体动作的决策方法对应的另一个网络架构图,如图4B所示,第一智能体接收到至少一个第二智能体发送第二协作消息之后,就需要通过第二模型对第一协作消息和第二协作消息进行处理,在将第一协作消息和第二协作消息传输至第二模型之前,还可以通过筛选模块对第一协作消息和第二协作消息进行筛选,将筛选后的信息输入至第二模型。
示例性的,筛选模型可以对第一协作消息和第二协作消息重复的信息进行删减,可以对比第一协作消息和第二协作消息,对错误信息进行纠正,具体不做限定。
可以理解的,步骤502和步骤503无时序上的先后关系,第一智能体可以先发送第一协作消息,再接收第二协作消息,也可以先接收第二协作消息,再发送第一协作消息,具体形式不做限定。
504、第一智能体通过第二模型,对所述第一协作消息和第二协作消息进行处理,得到所述第一智能体执行的第一协作动作。
第二模型用于动作决策,通过输入第一协作消息和第二协作消息,来确定智能体本身所要完成的协作动作,然后根据任务评价机制来获得奖励,并根据奖励来学习新知识,不断调整协作动作;最终累计最大奖励确定智能体的行动方案。
其中,第一模型和第二模型由相同的奖励确定;即第一模型和第二模型强化学习的目标一致,都可以为:
Figure PCTCN2021087642-appb-000002
其中,
Figure PCTCN2021087642-appb-000003
是任务的奖励;I(M;A)是协作消息与协作动作之间的互信息,最大化这一项即是从状态信息中提取到与协作动作最相关的协作信息;I(S;M)是协作消息与状态信息之间的互信息,最小化这一项实现对获取状态信息的压缩,去除对与协作动作无关的状态信息的获取。
可以理解的,第一模型和第二模型强化学习的目标中,可以只包括任务的奖励
Figure PCTCN2021087642-appb-000004
和协作消息与协作动作之间的互信息I(M;A),也可以只包括任务的奖励
Figure PCTCN2021087642-appb-000005
和协作消息与状态信息之间的互信息I(S;M),还可以包括三者,具体不做限定。
可以理解的,奖励与状态信息、第一协作消息、第二协作动作和第一协作动作都相关;是对状态信息、协作消息和协作动作的整体评价,即当智能体确定一个协作动作时,系统需要根据该智能体获取的状态信息与第一协作消息的相关度,第一协作消息与第一协作动作和第二协作动作的相关度来进行评价,确定最终的奖励。在训练过程中,第一模型和第二模型不断进行学习,确定多个行动方案,行动方案中包括状态信息的获取、第一协作消息的生成、以及第一协作动作和第二协作动作的确定。每个行动方案对应一个奖励,最大奖励对应的行动方案将会确定出最有效的状态信息的观测、确定出最有用的协作消息、以及最合适的协作动作。
可以理解的,奖励是对基于同一协作任务的第一智能体与至少一个第二智能体的任务完成度的评价,状态信息和第一协作消息的相关度越低,奖励越高;第一协作消息、第一协作动作和第二协作动作的相关度越高,奖励越高。
可以理解的,第一模型的作用为从状态信息中提取对任务最有用的信息生成第一协作信息,即第一模型需要完成对状态信息的筛选及压缩,当第一模型提取到最有效的信息时,其与状态信息的相关度也就越低,例如状态信息中包括十个不同方面的信息数据,最终第一模型筛选出与任务最相关的数据信息只有三个,这样就可以根据这三个数据信息得到第一协作消息,该第一协作消息与状态信息的相关度低,其奖励就高;第二模型的作用为根据协作消息得到最有用的动作,即第一协作消息与第一协作动作和第二协作动作的相关度越高,奖励越高。
本实施例通过对环境中采集的状态信息进行处理,得到第一协作消息,然后再对第一协作消息进行处理得到第一协作动作,使得原来根据状态信息同时得到协作消息和协作动作的并行网络模型变成串行网络模型,同时奖励除了与任务相关,也与状态信息、协作消息以及协作动作相关,这样使得每一个网络模型的结构更加简单,同时新的奖励机制可以 直接对第一模型和第二模型进行评价,减少了神经网络模型的训练复杂度,提高了协作任务的完成度。
下面对第一模型和第二模型的训练过程进行描述:
请参阅图6,为本申请实施例中的第一模型和第二模型对应的训练框架图,如图6所示:
训练框架包括第一智能体、第二智能体、共用的通信评价网络和动作评价网络、以及每个智能体的第一模型和第二模型;训练的过程为:
首先根据任务确定迭代轮数并且初始化所有网络的网络参数,然后第一智能体从环境中获取第一状态信息s1,第二智能体从环境中获取第一状态信息s2,然后基于共同的通信评价网络将第一状态信息s1输入至第一智能体第一模型,将第二状态信息s2输入至第二智能体第一模型,分别得到第一协作消息m1和第二协作消息m2;然后第一协作消息m1和第二协作消息m2传输至共同的动作评价网络中,由共同的动作评价网络传递信息至第一智能体的第二模型和第二智能体的第二模型,每个智能体的第二模型分别对m1和m2进行处理,分别得到第一协作动作a1和第二协作动作a2,然后根据a1和a2获得奖励,同时获取下一时刻的状态信息。当循环次数达到迭代轮数后,就累计计算总奖励并且保存训练模型,然后进行下一次训练,最后从多个训练模型中选取最佳模型,确定最佳参数。
请参阅图7,为本申请实施例中的第一模型和第二模型对应的另一种训练框架图,如图7所示:
训练框架包括第一智能体、第二智能体、每个智能体的第一模型和第二模型;训练的过程为:
首先根据任务确定迭代轮数并且初始化所有网络的网络参数,然后第一智能体从环境中获取第一状态信息s1,第二智能体从环境中获取第一状态信息s2,然后将第一状态信息s1输入至第一智能体第一模型,将第二状态信息s2输入至第二智能体第一模型,分别得到第一协作消息m1和第二协作消息m2;然后第一智能体将第一协作消息m1传输至第一智能体的第二模型和第二智能体,第二智能体将第二协作消息m2传输至第二智能体第二模型和第一智能体;第一智能体的第二模型分别对m1和m2进行处理,得到第一协作动作a1,第二智能体的第二模型分别对m1和m2进行处理,得到第二协作动作a2,然后根据a1和a2获得奖励,同时获取下一时刻的状态信息。当循环次数达到迭代轮数后,就累计计算总奖励并且保存训练模型,然后进行下一次训练,最后从多个训练模型中选取最佳模型,确定最佳参数。
请参阅图8,本申请实施例提供的一种第一智能体的结构示意图。如图8所示,该第一智能体800包括:
处理单元801,用于通过第一模型,对从环境中获取的第一状态信息进行处理,得到第一协作消息。
发送单元802,用于向至少一个第二智能体发送所述第一协作消息。
所述处理单元801,还用于通过第二模型,对所述第一协作消息和第二协作消息进行处理,得到所述第一智能体执行的第一协作动作,所述第二协作消息由所述至少一个第二智能体发送。
其中,所述第一模型和所述第二模型根据相同的奖励确定;所述第一协作消息还用于确定所述至少一个第二智能体需执行的第二协作动作;所述奖励与状态信息、所述第一协作消息、所述第二协作动作和所述第一协作动作相关;所述状态信息包括所述第一状态信息。
一种可能的实现中,所述奖励是对基于同一协作任务的所述第一智能体800与所述至少一个第二智能体的任务完成度的评价,所述奖励包括第一奖励和/或第二奖励,所述第一奖励为所述状态信息和所述第一协作消息的相关度;其中,所述状态信息和所述第一协作消息的相关度越低,所述第一奖励越高。
一种可能的实现中,所述第二奖励为所述第一协作消息、所述第一协作动作和所述第二协作动作的相关度;其中,所述第一协作消息、所述第一协作动作和所述第二协作动作的相关度越高,所述第二奖励越高。
一种可能的实现中,所述状态信息还包括第二状态信息,所述第二状态信息用于所述至少一个第二智能体根据所述第二状态信息得到所述第二协作消息。
一种可能的实现中,所述第一智能体800还包括获取单元803。
所述获取单元803,用于根据协作任务的评价机制从环境中获取所述第一状态信息;其中,所述第二状态信息为所述至少一个第二智能体根据所述协作任务的同一评价机制从环境中获得。
一种可能的实现中,所述第一智能体还包括接收单元804。
所述接收单元804,用于通过筛选模型接收所述第二协作消息,所述筛选模型用于根据所述第一协作消息对所述第二协作消息进行筛选。
一种可能的实现中,所述发送单元802具体用于通过通信模型对所述第一协作消息进行编码;所述发送单元802向所述至少一个第二智能体发送编码后的所述第一协作消息,以使得所述至少一个第二智能体通过所述通信模型对编码后的所述第一协作消息进行解码,获取所述第一协作消息。
上述第一智能体800的各个单元的功能,具体可参见前述图5所示的方法实施例中的第一智能体的实现细节,此处不再赘述。
请参阅图9,为本申请实施例提供的另一种第一智能体的结构示意图,该第一智能体900包括:中央处理器901,存储器902,通信接口903。
中央处理器901、存储器902、通信接口903通过总线相互连接;总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器902可以包括易失性存储器(volatile memory),例如随机存取存储器(random-access memory,RAM);存储器也可以包括非易失性存储器(non-volatile memory),例如快闪存储器(flash memory),硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器602还可以包括上述种类的存储器的组合。
中央处理器901可以是中央处理器(central processing unit,CPU),网络处理器(英 文:network processor,NP)或者CPU和NP的组合。中央处理器901还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信接口903可以为有线通信接口,无线通信接口或其组合,其中,有线通信接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线通信接口可以为WLAN接口,蜂窝网络通信接口或其组合等。
可选地,存储器902还可以用于存储程序指令,中央处理器901调用该存储器902中存储的程序指令,可以执行图5所示方法实施例中的一个或多个步骤,或其中可选的实施方式,使得所述第一智能体900实现上述方法中智能体的功能,具体此处不再赘述。
本申请实施例还提供了一种多智能体协作系统,包括:第一智能体和至少一个第二智能体;第一智能体和至少一个第二智能体执行如上述图5所示的任一项智能体动作的决策方法。
本申请实施例还提供了一种芯片或者芯片系统,该芯片或者芯片系统包括至少一个处理器和通信接口,通信接口和至少一个处理器通过线路互联,至少一个处理器运行指令或计算机程序,执行图5所示方法实施例中的一个或多个步骤,或其中可选的实施方式,以实现上述方法中智能体的功能。
其中,芯片中的通信接口可以为输入/输出接口、管脚或电路等。
在一种可能的实现中,上述描述的芯片或者芯片系统还包括至少一个存储器,该至少一个存储器中存储有指令。该存储器可以为芯片内部的存储单元,例如,寄存器、缓存等,也可以是该芯片的存储单元(例如,只读存储器、随机存取存储器等)。
本申请实施例还提供了一种计算机存储介质,该计算机存储介质中存储有实现本申请实施例提供的智能体动作的决策方法中智能体功能的计算机程序指令。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机软件指令,该计算机软件指令可通过处理器进行加载来实现上述图5所示智能体动作的决策方法中的流程。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服 务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (20)

  1. 一种智能体动作的决策方法,其特征在于,所述方法包括:
    第一智能体通过第一模型,对从环境中获取的第一状态信息进行处理,得到第一协作消息并向至少一个第二智能体发送所述第一协作消息;
    所述第一智能体通过第二模型,对所述第一协作消息和第二协作消息进行处理,得到所述第一智能体执行的第一协作动作,所述第二协作消息由所述至少一个第二智能体发送;
    其中,所述第一模型和所述第二模型根据相同的奖励确定;所述第一协作消息还用于确定所述至少一个第二智能体需执行的第二协作动作;所述奖励与状态信息、所述第一协作消息、所述第二协作动作和所述第一协作动作相关;所述状态信息包括所述第一状态信息。
  2. 根据权利要求1所述的方法,其特征在于,所述奖励是对基于同一协作任务的所述第一智能体与所述至少一个第二智能体的任务完成度的评价,所述奖励包括第一奖励和/或第二奖励,所述第一奖励为所述状态信息和所述第一协作消息的相关度;其中,所述状态信息和所述第一协作消息的相关度越低,所述第一奖励越高。
  3. 根据权利要求2所述的方法,其特征在于,所述第二奖励为所述第一协作消息、所述第一协作动作和所述第二协作动作的相关度;其中,所述第一协作消息、所述第一协作动作和所述第二协作动作的相关度越高,所述第二奖励越高。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述状态信息还包括第二状态信息,所述第二状态信息用于所述至少一个第二智能体根据所述第二状态信息得到所述第二协作消息。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    所述第一智能体根据协作任务的评价机制从环境中获取所述第一状态信息;其中,所述第二状态信息为所述至少一个第二智能体根据所述协作任务的同一评价机制从环境中获得。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,在所述第一智能体通过第二模型,对所述第一协作消息和第二协作消息进行处理之前,所述方法还包括:
    所述第一智能体通过筛选模型接收所述第二协作消息,所述筛选模型用于根据所述第一协作消息对所述第二协作消息进行筛选。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述第一智能体向至少一个第二智能体发送所述第一协作消息,包括:
    所述第一智能体通过通信模型对所述第一协作消息进行编码;
    所述第一智能体向所述至少一个第二智能体发送编码后的所述第一协作消息,以使得所述至少一个第二智能体通过所述通信模型对编码后的所述第一协作消息进行解码,获取所述第一协作消息。
  8. 一种第一智能体,其特征在于,所述第一智能体包括:
    处理单元,用于通过第一模型,对从环境中获取的第一状态信息进行处理,得到第一协作消息;
    发送单元,用于向至少一个第二智能体发送所述第一协作消息;
    所述处理单元,还用于通过第二模型,对所述第一协作消息和第二协作消息进行处理,得到所述第一智能体执行的第一协作动作,所述第二协作消息由所述至少一个第二智能体发送;
    其中,所述第一模型和所述第二模型根据相同的奖励确定;所述第一协作消息还用于确定所述至少一个第二智能体需执行的第二协作动作;所述奖励与状态信息、所述第一协作消息、所述第二协作动作和所述第一协作动作相关;所述状态信息包括所述第一状态信息。
  9. 根据权利要求8所述的第一智能体,其特征在于,所述奖励是对基于同一协作任务的所述第一智能体与所述至少一个第二智能体的任务完成度的评价,所述奖励包括第一奖励和/或第二奖励,所述第一奖励为所述状态信息和所述第一协作消息的相关度;其中,所述状态信息和所述第一协作消息的相关度越低,所述第一奖励越高。
  10. 根据权利要求9所述的第一智能体,其特征在于,所述第二奖励为所述第一协作消息、所述第一协作动作和所述第二协作动作的相关度;其中,所述第一协作消息、所述第一协作动作和所述第二协作动作的相关度越高,所述第二奖励越高。
  11. 根据权利要求8至10任一项所述的第一智能体,其特征在于,所述状态信息还包括第二状态信息,所述第二状态信息用于所述至少一个第二智能体根据所述第二状态信息得到所述第二协作消息。
  12. 根据权利要求11所述的第一智能体,其特征在于,所述第一智能体还包括:
    获取单元,用于根据协作任务的评价机制从环境中获取所述第一状态信息;其中,所述第二状态信息为所述至少一个第二智能体根据所述协作任务的同一评价机制从环境中获得。
  13. 根据权利要求8至12任一项所述的第一智能体,其特征在于,所述第一智能体还包括:
    接收单元,用于通过筛选模型接收所述第二协作消息,所述筛选模型用于根据所述第一协作消息对所述第二协作消息进行筛选。
  14. 根据权利要求8至13任一项所述的第一智能体,其特征在于,
    所述发送单元具体用于通过通信模型对所述第一协作消息进行编码;
    所述发送单元向所述至少一个第二智能体发送编码后的所述第一协作消息,以使得所述至少一个第二智能体通过所述通信模型对编码后的所述第一协作消息进行解码,获取所述第一协作消息。
  15. 一种智能体,包括:至少一个处理器、存储器,存储器存储有可在处理器上运行的计算机执行指令,当所述计算机执行指令被所述处理器执行时,使得所述智能体执行如上述权利要求1至权利要求7任一项所述的方法。
  16. 一种多智能体协作系统,其特征在于,包括:第一智能体和至少一个第二智能体,所述第一智能体和所述至少一个第二智能体执行如上述权利要求1至7任一项所述的方法。
  17. 一种存储一个或多个计算机执行指令的计算机可读存储介质,其特征在于,当所述计算机执行指令被处理器执行时,所述处理器执行如上述权利要求1至7任一项所述的方法。
  18. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序指令,当所述计算机程序指令被处理器执行时,所述处理器执行如权利要求1至7任一项所述的方法。
  19. 一种程序,其特征在于,当所述程序在计算机上运行时,使得所述计算机执行如权利要求1-7任一项所述的方法。
  20. 一种芯片,其特征在于,包括至少一个处理器和通信接口,通信接口和至少一个处理器通过线路互联,至少一个处理器用于运行计算机程序或指令,以执行如权利要求1至7任一项所述的方法。
PCT/CN2021/087642 2020-04-17 2021-04-16 智能体动作的决策方法及相关设备 WO2021209023A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21788348.7A EP4120042A4 (en) 2020-04-17 2021-04-16 DECISION-MAKING PROCESS FOR AGENT ACTION, AND ASSOCIATED MECHANISM
US17/964,233 US20230032176A1 (en) 2020-04-17 2022-10-12 Decision-making method for agent action and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010306503.7A CN113534784B (zh) 2020-04-17 2020-04-17 智能体动作的决策方法及相关设备
CN202010306503.7 2020-04-17

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/964,233 Continuation US20230032176A1 (en) 2020-04-17 2022-10-12 Decision-making method for agent action and related device

Publications (1)

Publication Number Publication Date
WO2021209023A1 true WO2021209023A1 (zh) 2021-10-21

Family

ID=78084156

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/087642 WO2021209023A1 (zh) 2020-04-17 2021-04-16 智能体动作的决策方法及相关设备

Country Status (4)

Country Link
US (1) US20230032176A1 (zh)
EP (1) EP4120042A4 (zh)
CN (1) CN113534784B (zh)
WO (1) WO2021209023A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023093734A1 (zh) * 2021-11-24 2023-06-01 华为技术有限公司 数据处理方法和通信装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8924069B1 (en) * 2008-04-09 2014-12-30 The United States of America as represented by the Administrator of the National Aeronautics & Space Administration (NASA) Artificial immune system approach for airborne vehicle maneuvering
CN105892480A (zh) * 2016-03-21 2016-08-24 南京航空航天大学 异构多无人机系统协同察打任务自组织方法
CN108334112A (zh) * 2018-04-08 2018-07-27 李良杰 无人机协作系统
CN109116854A (zh) * 2018-09-16 2019-01-01 南京大学 一种基于强化学习的多组机器人协作控制方法及控制系统
CN110442133A (zh) * 2019-07-29 2019-11-12 南京市晨枭软件技术有限公司 一种多组工业机器人协同作业的方法及系统
CN210198395U (zh) * 2019-03-18 2020-03-27 东莞理工学院 无人机与无人车协作导航系统

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750745B (zh) * 2013-12-30 2019-08-16 华为技术有限公司 一种智能体处理信息的方法及智能体
GB2554135B (en) * 2016-07-08 2020-05-20 Jaguar Land Rover Ltd Vehicle communication system and method
CN106964145B (zh) * 2017-03-28 2020-11-10 南京邮电大学 一种仿人足球机器人传球控制方法及球队控球方法
US11630998B2 (en) * 2018-03-26 2023-04-18 Cohda Wireless Pty Ltd. Systems and methods for automatically training neural networks
CN108600379A (zh) * 2018-04-28 2018-09-28 中国科学院软件研究所 一种基于深度确定性策略梯度的异构多智能体协同决策方法
CN109617968B (zh) * 2018-12-14 2019-10-29 启元世界(北京)信息技术服务有限公司 多智能体协作系统及其智能体、智能体间的通信方法
CN110471297B (zh) * 2019-07-30 2020-08-11 清华大学 多智能体协同控制方法、系统及设备
CN110991972B (zh) * 2019-12-14 2022-06-21 中国科学院深圳先进技术研究院 一种基于多智能体强化学习的货物运输系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8924069B1 (en) * 2008-04-09 2014-12-30 The United States of America as represented by the Administrator of the National Aeronautics & Space Administration (NASA) Artificial immune system approach for airborne vehicle maneuvering
CN105892480A (zh) * 2016-03-21 2016-08-24 南京航空航天大学 异构多无人机系统协同察打任务自组织方法
CN108334112A (zh) * 2018-04-08 2018-07-27 李良杰 无人机协作系统
CN109116854A (zh) * 2018-09-16 2019-01-01 南京大学 一种基于强化学习的多组机器人协作控制方法及控制系统
CN210198395U (zh) * 2019-03-18 2020-03-27 东莞理工学院 无人机与无人车协作导航系统
CN110442133A (zh) * 2019-07-29 2019-11-12 南京市晨枭软件技术有限公司 一种多组工业机器人协同作业的方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4120042A4

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023093734A1 (zh) * 2021-11-24 2023-06-01 华为技术有限公司 数据处理方法和通信装置

Also Published As

Publication number Publication date
US20230032176A1 (en) 2023-02-02
EP4120042A4 (en) 2023-07-26
CN113534784B (zh) 2024-03-05
EP4120042A1 (en) 2023-01-18
CN113534784A (zh) 2021-10-22

Similar Documents

Publication Publication Date Title
CN111931905B (zh) 一种图卷积神经网络模型、及利用该模型的车辆轨迹预测方法
US10163420B2 (en) System, apparatus and methods for adaptive data transport and optimization of application execution
US11940803B2 (en) Method, apparatus and computer storage medium for training trajectory planning model
Yang et al. Semantic communications with AI tasks
CN114519932B (zh) 一种基于时空关系抽取的区域交通状况集成预测方法
WO2021209023A1 (zh) 智能体动作的决策方法及相关设备
CN113422952B (zh) 基于时空传播层次编解码器的视频预测方法
Getu et al. Making sense of meaning: A survey on metrics for semantic and goal-oriented communication
Aparna et al. Steering angle prediction for autonomous driving using federated learning: The impact of vehicle-to-everything communication
Seo et al. Towards semantic communication protocols: A probabilistic logic perspective
Pan et al. Image segmentation semantic communication over internet of vehicles
Liu et al. Towards vehicle-to-everything autonomous driving: A survey on collaborative perception
CN113541986B (zh) 5g切片的故障预测方法、装置及计算设备
CN113255750A (zh) 一种基于深度学习的vcc车辆攻击检测方法
Liu et al. HPL-ViT: A Unified Perception Framework for Heterogeneous Parallel LiDARs in V2V
WO2023236601A1 (zh) 参数预测方法、预测服务器、预测系统及电子设备
CN115762147B (zh) 一种基于自适应图注意神经网络的交通流量预测方法
US20220391731A1 (en) Agent decision-making method and apparatus
CN114332699B (zh) 路况预测方法、装置、设备及存储介质
US20220388522A1 (en) Systems and methods for end-to-end learning of optimal driving policy
Yang et al. Semantic Change Driven Generative Semantic Communication Framework
CN114327935A (zh) 一种通信敏感的多智能体协同方法
WO2023116787A1 (zh) 智能模型的训练方法和装置
CN113420879A (zh) 多任务学习模型的预测方法及装置
CN113179484B (zh) 一种基于车联网模型的idnc网络编码方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21788348

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021788348

Country of ref document: EP

Effective date: 20221014

NENP Non-entry into the national phase

Ref country code: DE