CN112183288B

CN112183288B - Multi-agent reinforcement learning method based on model

Info

Publication number: CN112183288B
Application number: CN202011002376.8A
Authority: CN
Inventors: 张伟楠; 王锡淮; 沈键; 周铭
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2022-10-21
Anticipated expiration: 2040-09-22
Also published as: CN112183288A

Abstract

The invention discloses a multi-agent reinforcement learning method based on a model, which belongs to the field of multi-agent reinforcement learning and comprises the steps of modeling multi-agent environment and strategies, generating virtual tracks of multi-agents and updating the strategies of the multi-agents by utilizing the virtual tracks. In the invention, each intelligent agent is in a distributed type to make a decision, the multi-intelligent-agent environment and the opponent intelligent agent are respectively subjected to strategy modeling, and the obtained model is used for generating the virtual track, so that the sampling efficiency of the multi-intelligent-agent reinforcement learning can be effectively improved, the interaction times of the intelligent agents are reduced, the equipment damage risk is reduced, and the feasibility of deploying the distributed multi-intelligent-agent reinforcement learning method in the multi-intelligent-agent task is improved.

Description

Multi-agent reinforcement learning method based on model

Technical Field

The invention relates to the field of multi-agent reinforcement learning methods, in particular to a model-based multi-agent reinforcement learning method.

Background

Reinforcement learning is a sub-field of machine learning whose goal is to make decision-making actions based on received environmental information in order to achieve maximized expected yields. The deep reinforcement learning utilizes the neural network to approximate the value function and the strategy function, and the performance exceeding the average level of human beings is obtained on a plurality of tasks. In a multi-agent scenario, each agent is learning and improving, resulting in an unstable environment, and the relationship between agents may be competitive, cooperative, or intermediate. How and what information is shared among agents also becomes a difficulty. Based on the above problems introduced by the multi-agent scenario, the single-agent approach cannot be directly applied to the multi-agent scenario. Similar to the algorithm of a single agent, the algorithm of the reinforcement learning of multiple agents is divided into two categories, i.e. no model and a model. Among them, the multi-agent reinforcement learning algorithm without model faces more serious sample efficiency problem.

A model-based multi-agent reinforcement learning method aims to improve the sample efficiency of a multi-agent reinforcement learning algorithm. I.e. to reduce the number of interactions of the agents with the environment and the number of interactions between the agents. In general, there are currently situations where reinforcement learning is inefficient when landing on a particular application. In the application of multi-agent reinforcement learning, the joint action space and the joint state space of each agent further reduce the sample efficiency. When a multi-agent reinforcement learning is used in a training multi-vehicle automatic driving scene, usually a plurality of vehicles need to do reasonable actions in different scenes through massive training, and in the massive training, the vehicles continuously interact with the environment and the vehicles, so that the possibility of vehicle damage is high. Using a model-based approach helps to reduce training costs.

Analyzing recent patent technologies related to multi-agent reinforcement learning and model-based reinforcement learning:

1. the Chinese invention patent application No. 201811032979.5, a Path planning method based on Multi-agent reinforcement learning, provides a Multi-agent Path planning method based on the aircraft field, improves the survival rate and the task completion rate of the aircraft by establishing a global state division model of the air flight environment, mainly uses an environment model for planning, and considers the interaction among the agents;

2. the Chinese patent application No. 201911121271.1, a cooperative type agent learning method based on multi-agent reinforcement learning, provides a method for sharing target parameters by agents, models the global environment, and the agents share the global model to improve the efficiency of multi-agent algorithm.

(II) analyzing the recent research of the model-based multi-agent reinforcement learning method:

in 'Multi-agent recovery with improvement model for comprehensive functions' published in PLoS One journal in 2019, modeling of the global environment is used as an auxiliary task to deeply learn about the cognition of the Multi-agent environment. But this work does not improve the sample efficiency of the algorithm.

A paper 'Multi agent discovery with Multi-step generating models' published in 2019 at The Conference on Robot Learning (CoRL) Conference uses a differential automatic encoder to model a Multi-agent environment and a strategy of an opponent agent, directly predicts a section of track, and then selects an optimal track by using a model prediction control method. The method effectively improves the sample efficiency, but the lack of the strategy function increases the decision cost, and meanwhile, the centralized training and decision make the algorithm difficult to deploy in practical application.

Therefore, those skilled in the art are working on developing a model-based Multi-Agent Branched-roll out Policy Optimization (Multi-Agent Branched-roll out Policy Optimization), a Multi-Agent reinforcement learning method that can achieve higher sample efficiency in any environment.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is how to reduce the number of interactions between a smart agent and an environment and between a smart agent and a smart agent, and at the same time, to enable distributed execution.

In order to achieve the above object, the present invention provides a model-based multi-agent reinforcement learning method, which is characterized in that in a multi-agent environment, a multi-agent environment and a strategy are modeled to generate a virtual track of a multi-agent, and the strategy of the multi-agent is updated by using the virtual track.

Further, the multi-agent makes distributed decisions.

Further, for current agent i, keeping the set of adversary agents as { -i }, the action of current agent i depends on the joint policy pi of adversary agents ^-i And the current state s _t Let the combined action of the adversary agent at time t be

The current agent's action is represented as

Wherein pi ⁱ Is the policy of the current agent.

Further, multiple agents all hold independent multiple agent environment models

And set of adversary policy models

Further, a method of dynamically selecting an opponent model is used when generating the virtual trajectory.

Further, for current agent i, the model for each adversary strategy is represented as

Wherein j belongs to { -i }, the method for dynamically selecting the adversary model comprises two steps:

step a, strategy model for each opponent

Selecting a part of real interaction data which occur recently, calculating the generalization error of the strategy model, and marking as the epsilon _j ；

Step b, giving the length K of the virtual track, and then giving the length K of the virtual track to the opponent agent j, before

Using the model of the opponent agent's policy to generate an action of the opponent agent at that time; and following K-n _j The adversary agent is requested in steps for actions taken under its true policy.

Further, the generation of the virtual trajectory comprises the following steps:

step 1, initializing t =0, wherein the length of a virtual track is K;

step 2, selecting a state s from the real track _t ；

Step 3, obtaining the combined action of the other opponents under the state s

Step 4, obtaining the action a of the current agent by using the strategy function of the current agent ⁱ ＝π ⁱ (s,a ^-i )；

Step 5, using the model of the multi-agent environment to predict the state at the next moment

And the current time prize r _t ；

Step 6, mixing(s) _t ,a ⁱ ,a ^-i ,s _t+1 ,r _t ) Put into an experience playback pool

Performing the following steps;

and 7, repeating the step 1 until t is larger than K by letting t = t + 1.

Further, after the multi-agent environment and the opponent agent strategy are modeled to a certain precision, a virtual track is generated.

Further, using Gaussian distribution to represent the output of the model when modeling the multi-agent environment and the opponent agent strategy, establishing a plurality of models for the multi-agent environment, and using the multi-agent environment model by using an ensemble learning method; let the number of environment models be B, then the set of environment models be

Wherein B belongs to {1, …, B }; the adversary strategy model is

Wherein j ∈ { -i }.

Further, gradient descent is used for updating when modeling multi-agent environments and opponent agent policies.

Further, when updating the policy of the current agent, a flexible Actor-Critic (Soft Actor Critic) algorithm is used.

The updated formula of the critic part Q function is:

the updating formula of the actor part strategy function is as follows:

wherein e _t For noise sampled from a gaussian distribution, f is the reparameterization function.

The multi-agent environment and the opponent agent strategy model applied in the invention are closer to the real model and strategy in a regular way, so that the generated virtual track is more and more real. The intelligent agent utilizes the generated virtual track to be more approximate to the state which can be reached under the real condition and the real intelligent agent interaction, and simultaneously can explore the state and the interaction situation which are difficult to reach by the real track. Therefore, the intelligent agent can effectively train in the virtual track, the possibility of experiencing dangerous states and interaction in a real situation is reduced, the damage risk is reduced, and the training cost is reduced. In a rule, the agent can be trained more comprehensively and abundantly by using a multi-agent environment model and an opponent agent strategy model.

The invention has the following technical effects:

1. in the invention, the decision of each agent can be independently carried out, and optionally, the effect can be improved by carrying out communication by each agent.

2. The agent of the present invention may not be limited to a specific motion space, including discrete and continuous motion spaces, and thus may be used in conjunction with any reinforcement learning algorithm, such as DQN, A3C, PPO, etc.

3. The agent of the present invention may not be limited to a particular state space, and may thus be combined with a modeling method.

The conception, specific structure and technical effects of the present invention will be further described in conjunction with the accompanying drawings to fully understand the purpose, characteristics and effects of the present invention.

Drawings

FIG. 1 is a diagram of a training framework for the method of the present invention;

FIG. 2 is a flow chart of the method of the present invention.

Detailed Description

A preferred embodiment of the present invention will be described below with reference to the accompanying drawings for clarity and understanding of the technical contents thereof. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

The embodiment of the invention provides a model-based multi-agent reinforcement learning method. The embodiments of the present invention apply the method in an environment where vehicles are automatically driven, where there are several vehicles, each with a different destination. The method comprises the following specific steps:

1. an observation space (namely an input space of the method) in the automatic driving scene of the vehicle is defined, and the observation space comprises the position of the vehicle in the high-definition space semantic map, the positions of other vehicles, pedestrians and other individuals in the high-definition space semantic map, a driving track in the plan, the distance and the direction from a peripheral obstacle to the vehicle sensed by a sensor, and the like. The motion space of the robot is defined as acceleration, direction, braking, etc. Defining the external reward that can be obtained by the vehicle to be determined by factors such as speed, route, impact, comfort and the like;

2. for each vehicle, a strategy function pi, a Q function network and a multi-agent environment model are initialized randomly

Other vehicle policy model sets

Real track database D _env Database of virtual trajectories D _model ；

3. For each Epoch (Epoch):

(1) Updating multiple agent environment model for each vehicle

Wherein the state s is composed of observations of each vehicle, each vehicle will observe itself during the training periodAnd the measurement result is sent to other vehicles.

4. For each time t:

(1) Each vehicle updates the model for other vehicle strategies

(2) Each vehicle independently makes a decision, and when making the decision, firstly, the model related to other vehicle strategies is utilized to generate real interactive data, and the real interactive data is added into a real track database D _env Performing the following steps;

(3) Each vehicle calculates its model error { ∈ with respect to other vehicle strategies _j Calculating the length n that each model should use when generating the virtual track _j }；

(4) Each vehicle uses a method for dynamically selecting an opponent model, generates a virtual track by using a respective multi-agent environment model, and adds the virtual track into a virtual track database D _model In (1). Wherein, in the process of dynamically selecting the opponent model, when the vehicle i needs to use the vehicle j in the state s _t When the real strategy is applied, if the state is generated in the real environment, the state s is calculated first _t And (4) obtaining the observation result of the vehicle j, otherwise, directly outputting the observation result of the vehicle j by the multi-agent environment model of the vehicle i. Vehicle i obtains observation o of vehicle j _j Then, o is mixed _j The decision a is transmitted to the vehicle j, and the vehicle j then makes the decision a under the real strategy ^j And transmitted to vehicle i.

5. And each vehicle updates the strategy function and the Q value function by using the data of the real track database and the virtual track database. Wherein, the loss function of the Q value function is:

the penalty function for the policy function is:

wherein e _t For noise sampled from a gaussian distribution, a re-parameterized function.

In the scene of multi-vehicle automatic driving, the method can improve the sample efficiency of the multi-agent reinforcement learning algorithm and reduce the times of real actions of the vehicle in the training process. Under the condition of only using a model-free multi-agent reinforcement learning algorithm, a large amount of training is carried out on each vehicle in a real environment, so that the damage risk is high, the vehicles using the method can carry out virtual interaction during training, the action in the real environment is reduced, the risk is reduced, meanwhile, the state and the action space can be explored more comprehensively, and a better strategy can be learned under a safer condition.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions that can be obtained by a person skilled in the art through logical analysis, reasoning or limited experiments based on the prior art according to the concepts of the present invention should be within the scope of protection determined by the claims.

Claims

1. A model-based multi-agent reinforcement learning method, characterized in that in a multi-agent environment, the multi-agent environment and a strategy are modeled, a virtual track of the multi-agent is generated, the strategy of the multi-agent is updated by the virtual track, and the agent is a vehicle;

when the virtual track is generated, a method for dynamically selecting an opponent model is used;

for current agent i, the model for each adversary strategy is represented as

The opponent is other vehicles, wherein j ∈ { -i }, and the method for dynamically selecting the opponent model comprises two steps:

step a, strategy for each adversaryModel (model)

Selecting a part of real interaction data which occurs recently, calculating the generalization error of the strategy model, and marking as the epsilon _j The real interactive data is that each vehicle independently makes a decision, and a model related to other vehicle strategies is used for generating when making the decision;

step b, giving the length K of the virtual track, and then giving the length K to the opponent agent j, before

Using a model of the opponent agent's policy to generate an action of the opponent agent at that time; and following K-n _j The adversary agent is requested in steps for the actions taken under its real policy.

2. The model-based multi-agent reinforcement learning method of claim 1, wherein the multi-agent makes distributed decisions.

3. The model-based multi-agent reinforcement learning method of claim 2, characterized in that for a current agent i, the set of opponent agents is remembered { -i }, the action of the current agent i depends on the joint policy pi of the opponent agents ^-i And the current state s _t Let the combined action of the opponent agent at time t be

The action of the current agent is represented as

Wherein pi ⁱ A policy for the current agent.

4. The model-based multi-agent reinforcement learning method of claim 3, wherein each of said multi-agents holds an independent multi-agent environment model

And set of adversary policy models

Wherein, a ⁱ For the current action of the agent, a ^-i The other opponents are in joint action in the state s.

5. The model-based multi-agent reinforcement learning method of claim 4, wherein the generation of the virtual trajectory comprises the steps of:

step 1, initializing t =0, wherein the length of a virtual track is K;

step 2, selecting a state s from the real track _t ；

Step 3, obtaining the combined action of the other opponents under the state s

And the current time prize r _t ；

Performing the following steps;

and 7, repeating the step 1 until t > K by enabling t = t + 1.

6. The model-based multi-agent reinforcement learning method of claim 5, characterized in that the virtual trajectory is regenerated after modeling the multi-agent environment and opponent agent strategies to a set precision.

7. The model-based multi-agent reinforcement learning method of claim 6, characterized in that gaussian distributions are used to represent model outputs when modeling the multi-agent environment and the adversary agent policies, and multiple models are built for the multi-agent environment, the multi-agent environment model being used using an ensemble learning approach; let the number of environment models be B, then the environment model be

Wherein B belongs to {1, …, B }; the adversary strategy model is

Wherein j ∈ { -i }.

8. The model-based multi-agent reinforcement learning method of claim 7, wherein gradient descent is used to update in modeling the multi-agent environment and the opponent agent policy.