CN112183288B - Multi-agent reinforcement learning method based on model - Google Patents

Multi-agent reinforcement learning method based on model Download PDF

Info

Publication number
CN112183288B
CN112183288B CN202011002376.8A CN202011002376A CN112183288B CN 112183288 B CN112183288 B CN 112183288B CN 202011002376 A CN202011002376 A CN 202011002376A CN 112183288 B CN112183288 B CN 112183288B
Authority
CN
China
Prior art keywords
agent
model
environment
reinforcement learning
opponent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011002376.8A
Other languages
Chinese (zh)
Other versions
CN112183288A (en
Inventor
张伟楠
王锡淮
沈键
周铭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202011002376.8A priority Critical patent/CN112183288B/en
Publication of CN112183288A publication Critical patent/CN112183288A/en
Application granted granted Critical
Publication of CN112183288B publication Critical patent/CN112183288B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G07CHECKING-DEVICES
    • G07CTIME OR ATTENDANCE REGISTERS; REGISTERING OR INDICATING THE WORKING OF MACHINES; GENERATING RANDOM NUMBERS; VOTING OR LOTTERY APPARATUS; ARRANGEMENTS, SYSTEMS OR APPARATUS FOR CHECKING NOT PROVIDED FOR ELSEWHERE
    • G07C5/00Registering or indicating the working of vehicles
    • G07C5/08Registering or indicating performance data other than driving, working, idle, or waiting time, with or without registering driving, working, idle or waiting time
    • G07C5/0808Diagnosing performance data

Abstract

The invention discloses a multi-agent reinforcement learning method based on a model, which belongs to the field of multi-agent reinforcement learning and comprises the steps of modeling multi-agent environment and strategies, generating virtual tracks of multi-agents and updating the strategies of the multi-agents by utilizing the virtual tracks. In the invention, each intelligent agent is in a distributed type to make a decision, the multi-intelligent-agent environment and the opponent intelligent agent are respectively subjected to strategy modeling, and the obtained model is used for generating the virtual track, so that the sampling efficiency of the multi-intelligent-agent reinforcement learning can be effectively improved, the interaction times of the intelligent agents are reduced, the equipment damage risk is reduced, and the feasibility of deploying the distributed multi-intelligent-agent reinforcement learning method in the multi-intelligent-agent task is improved.

Description

Multi-agent reinforcement learning method based on model
Technical Field
The invention relates to the field of multi-agent reinforcement learning methods, in particular to a model-based multi-agent reinforcement learning method.
Background
Reinforcement learning is a sub-field of machine learning whose goal is to make decision-making actions based on received environmental information in order to achieve maximized expected yields. The deep reinforcement learning utilizes the neural network to approximate the value function and the strategy function, and the performance exceeding the average level of human beings is obtained on a plurality of tasks. In a multi-agent scenario, each agent is learning and improving, resulting in an unstable environment, and the relationship between agents may be competitive, cooperative, or intermediate. How and what information is shared among agents also becomes a difficulty. Based on the above problems introduced by the multi-agent scenario, the single-agent approach cannot be directly applied to the multi-agent scenario. Similar to the algorithm of a single agent, the algorithm of the reinforcement learning of multiple agents is divided into two categories, i.e. no model and a model. Among them, the multi-agent reinforcement learning algorithm without model faces more serious sample efficiency problem.
A model-based multi-agent reinforcement learning method aims to improve the sample efficiency of a multi-agent reinforcement learning algorithm. I.e. to reduce the number of interactions of the agents with the environment and the number of interactions between the agents. In general, there are currently situations where reinforcement learning is inefficient when landing on a particular application. In the application of multi-agent reinforcement learning, the joint action space and the joint state space of each agent further reduce the sample efficiency. When a multi-agent reinforcement learning is used in a training multi-vehicle automatic driving scene, usually a plurality of vehicles need to do reasonable actions in different scenes through massive training, and in the massive training, the vehicles continuously interact with the environment and the vehicles, so that the possibility of vehicle damage is high. Using a model-based approach helps to reduce training costs.
Analyzing recent patent technologies related to multi-agent reinforcement learning and model-based reinforcement learning:
1. the Chinese invention patent application No. 201811032979.5, a Path planning method based on Multi-agent reinforcement learning, provides a Multi-agent Path planning method based on the aircraft field, improves the survival rate and the task completion rate of the aircraft by establishing a global state division model of the air flight environment, mainly uses an environment model for planning, and considers the interaction among the agents;
2. the Chinese patent application No. 201911121271.1, a cooperative type agent learning method based on multi-agent reinforcement learning, provides a method for sharing target parameters by agents, models the global environment, and the agents share the global model to improve the efficiency of multi-agent algorithm.
(II) analyzing the recent research of the model-based multi-agent reinforcement learning method:
in 'Multi-agent recovery with improvement model for comprehensive functions' published in PLoS One journal in 2019, modeling of the global environment is used as an auxiliary task to deeply learn about the cognition of the Multi-agent environment. But this work does not improve the sample efficiency of the algorithm.
A paper 'Multi agent discovery with Multi-step generating models' published in 2019 at The Conference on Robot Learning (CoRL) Conference uses a differential automatic encoder to model a Multi-agent environment and a strategy of an opponent agent, directly predicts a section of track, and then selects an optimal track by using a model prediction control method. The method effectively improves the sample efficiency, but the lack of the strategy function increases the decision cost, and meanwhile, the centralized training and decision make the algorithm difficult to deploy in practical application.
Therefore, those skilled in the art are working on developing a model-based Multi-Agent Branched-roll out Policy Optimization (Multi-Agent Branched-roll out Policy Optimization), a Multi-Agent reinforcement learning method that can achieve higher sample efficiency in any environment.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is how to reduce the number of interactions between a smart agent and an environment and between a smart agent and a smart agent, and at the same time, to enable distributed execution.
In order to achieve the above object, the present invention provides a model-based multi-agent reinforcement learning method, which is characterized in that in a multi-agent environment, a multi-agent environment and a strategy are modeled to generate a virtual track of a multi-agent, and the strategy of the multi-agent is updated by using the virtual track.
Further, the multi-agent makes distributed decisions.
Further, for current agent i, keeping the set of adversary agents as { -i }, the action of current agent i depends on the joint policy pi of adversary agents -i And the current state s t Let the combined action of the adversary agent at time t be
Figure BDA0002694773740000021
The current agent's action is represented as
Figure BDA0002694773740000022
Wherein pi i Is the policy of the current agent.
Further, multiple agents all hold independent multiple agent environment models
Figure BDA0002694773740000023
And set of adversary policy models
Figure BDA0002694773740000024
Further, a method of dynamically selecting an opponent model is used when generating the virtual trajectory.
Further, for current agent i, the model for each adversary strategy is represented as
Figure BDA0002694773740000025
Wherein j belongs to { -i }, the method for dynamically selecting the adversary model comprises two steps:
step a, strategy model for each opponent
Figure BDA0002694773740000026
Selecting a part of real interaction data which occur recently, calculating the generalization error of the strategy model, and marking as the epsilon j
Step b, giving the length K of the virtual track, and then giving the length K of the virtual track to the opponent agent j, before
Figure BDA0002694773740000027
Using the model of the opponent agent's policy to generate an action of the opponent agent at that time; and following K-n j The adversary agent is requested in steps for actions taken under its true policy.
Further, the generation of the virtual trajectory comprises the following steps:
step 1, initializing t =0, wherein the length of a virtual track is K;
step 2, selecting a state s from the real track t
Step 3, obtaining the combined action of the other opponents under the state s
Figure BDA0002694773740000031
Step 4, obtaining the action a of the current agent by using the strategy function of the current agent i =π i (s,a -i );
Step 5, using the model of the multi-agent environment to predict the state at the next moment
Figure BDA0002694773740000032
And the current time prize r t
Step 6, mixing(s) t ,a i ,a -i ,s t+1 ,r t ) Put into an experience playback pool
Figure BDA0002694773740000037
Performing the following steps;
and 7, repeating the step 1 until t is larger than K by letting t = t + 1.
Further, after the multi-agent environment and the opponent agent strategy are modeled to a certain precision, a virtual track is generated.
Further, using Gaussian distribution to represent the output of the model when modeling the multi-agent environment and the opponent agent strategy, establishing a plurality of models for the multi-agent environment, and using the multi-agent environment model by using an ensemble learning method; let the number of environment models be B, then the set of environment models be
Figure BDA0002694773740000033
Wherein B belongs to {1, …, B }; the adversary strategy model is
Figure BDA0002694773740000034
Wherein j ∈ { -i }.
Further, gradient descent is used for updating when modeling multi-agent environments and opponent agent policies.
Further, when updating the policy of the current agent, a flexible Actor-Critic (Soft Actor Critic) algorithm is used.
The updated formula of the critic part Q function is:
Figure BDA0002694773740000035
the updating formula of the actor part strategy function is as follows:
Figure BDA0002694773740000036
wherein e t For noise sampled from a gaussian distribution, f is the reparameterization function.
The multi-agent environment and the opponent agent strategy model applied in the invention are closer to the real model and strategy in a regular way, so that the generated virtual track is more and more real. The intelligent agent utilizes the generated virtual track to be more approximate to the state which can be reached under the real condition and the real intelligent agent interaction, and simultaneously can explore the state and the interaction situation which are difficult to reach by the real track. Therefore, the intelligent agent can effectively train in the virtual track, the possibility of experiencing dangerous states and interaction in a real situation is reduced, the damage risk is reduced, and the training cost is reduced. In a rule, the agent can be trained more comprehensively and abundantly by using a multi-agent environment model and an opponent agent strategy model.
The invention has the following technical effects:
1. in the invention, the decision of each agent can be independently carried out, and optionally, the effect can be improved by carrying out communication by each agent.
2. The agent of the present invention may not be limited to a specific motion space, including discrete and continuous motion spaces, and thus may be used in conjunction with any reinforcement learning algorithm, such as DQN, A3C, PPO, etc.
3. The agent of the present invention may not be limited to a particular state space, and may thus be combined with a modeling method.
The conception, specific structure and technical effects of the present invention will be further described in conjunction with the accompanying drawings to fully understand the purpose, characteristics and effects of the present invention.
Drawings
FIG. 1 is a diagram of a training framework for the method of the present invention;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
A preferred embodiment of the present invention will be described below with reference to the accompanying drawings for clarity and understanding of the technical contents thereof. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.
The embodiment of the invention provides a model-based multi-agent reinforcement learning method. The embodiments of the present invention apply the method in an environment where vehicles are automatically driven, where there are several vehicles, each with a different destination. The method comprises the following specific steps:
1. an observation space (namely an input space of the method) in the automatic driving scene of the vehicle is defined, and the observation space comprises the position of the vehicle in the high-definition space semantic map, the positions of other vehicles, pedestrians and other individuals in the high-definition space semantic map, a driving track in the plan, the distance and the direction from a peripheral obstacle to the vehicle sensed by a sensor, and the like. The motion space of the robot is defined as acceleration, direction, braking, etc. Defining the external reward that can be obtained by the vehicle to be determined by factors such as speed, route, impact, comfort and the like;
2. for each vehicle, a strategy function pi, a Q function network and a multi-agent environment model are initialized randomly
Figure BDA0002694773740000041
Other vehicle policy model sets
Figure BDA0002694773740000042
Real track database D env Database of virtual trajectories D model
3. For each Epoch (Epoch):
(1) Updating multiple agent environment model for each vehicle
Figure BDA0002694773740000043
Wherein the state s is composed of observations of each vehicle, each vehicle will observe itself during the training periodAnd the measurement result is sent to other vehicles.
4. For each time t:
(1) Each vehicle updates the model for other vehicle strategies
Figure BDA0002694773740000044
(2) Each vehicle independently makes a decision, and when making the decision, firstly, the model related to other vehicle strategies is utilized to generate real interactive data, and the real interactive data is added into a real track database D env Performing the following steps;
(3) Each vehicle calculates its model error { ∈ with respect to other vehicle strategies j Calculating the length n that each model should use when generating the virtual track j };
(4) Each vehicle uses a method for dynamically selecting an opponent model, generates a virtual track by using a respective multi-agent environment model, and adds the virtual track into a virtual track database D model In (1). Wherein, in the process of dynamically selecting the opponent model, when the vehicle i needs to use the vehicle j in the state s t When the real strategy is applied, if the state is generated in the real environment, the state s is calculated first t And (4) obtaining the observation result of the vehicle j, otherwise, directly outputting the observation result of the vehicle j by the multi-agent environment model of the vehicle i. Vehicle i obtains observation o of vehicle j j Then, o is mixed j The decision a is transmitted to the vehicle j, and the vehicle j then makes the decision a under the real strategy j And transmitted to vehicle i.
5. And each vehicle updates the strategy function and the Q value function by using the data of the real track database and the virtual track database. Wherein, the loss function of the Q value function is:
Figure BDA0002694773740000051
the penalty function for the policy function is:
Figure BDA0002694773740000052
wherein e t For noise sampled from a gaussian distribution, a re-parameterized function.
In the scene of multi-vehicle automatic driving, the method can improve the sample efficiency of the multi-agent reinforcement learning algorithm and reduce the times of real actions of the vehicle in the training process. Under the condition of only using a model-free multi-agent reinforcement learning algorithm, a large amount of training is carried out on each vehicle in a real environment, so that the damage risk is high, the vehicles using the method can carry out virtual interaction during training, the action in the real environment is reduced, the risk is reduced, meanwhile, the state and the action space can be explored more comprehensively, and a better strategy can be learned under a safer condition.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions that can be obtained by a person skilled in the art through logical analysis, reasoning or limited experiments based on the prior art according to the concepts of the present invention should be within the scope of protection determined by the claims.

Claims (8)

1. A model-based multi-agent reinforcement learning method, characterized in that in a multi-agent environment, the multi-agent environment and a strategy are modeled, a virtual track of the multi-agent is generated, the strategy of the multi-agent is updated by the virtual track, and the agent is a vehicle;
when the virtual track is generated, a method for dynamically selecting an opponent model is used;
for current agent i, the model for each adversary strategy is represented as
Figure FDA0003808268180000011
The opponent is other vehicles, wherein j ∈ { -i }, and the method for dynamically selecting the opponent model comprises two steps:
step a, strategy for each adversaryModel (model)
Figure FDA0003808268180000012
Selecting a part of real interaction data which occurs recently, calculating the generalization error of the strategy model, and marking as the epsilon j The real interactive data is that each vehicle independently makes a decision, and a model related to other vehicle strategies is used for generating when making the decision;
step b, giving the length K of the virtual track, and then giving the length K to the opponent agent j, before
Figure FDA0003808268180000019
Figure FDA0003808268180000018
Using a model of the opponent agent's policy to generate an action of the opponent agent at that time; and following K-n j The adversary agent is requested in steps for the actions taken under its real policy.
2. The model-based multi-agent reinforcement learning method of claim 1, wherein the multi-agent makes distributed decisions.
3. The model-based multi-agent reinforcement learning method of claim 2, characterized in that for a current agent i, the set of opponent agents is remembered { -i }, the action of the current agent i depends on the joint policy pi of the opponent agents -i And the current state s t Let the combined action of the opponent agent at time t be
Figure FDA0003808268180000013
Figure FDA0003808268180000014
The action of the current agent is represented as
Figure FDA0003808268180000015
Wherein pi i A policy for the current agent.
4. The model-based multi-agent reinforcement learning method of claim 3, wherein each of said multi-agents holds an independent multi-agent environment model
Figure FDA0003808268180000016
And set of adversary policy models
Figure FDA0003808268180000017
Wherein, a i For the current action of the agent, a -i The other opponents are in joint action in the state s.
5. The model-based multi-agent reinforcement learning method of claim 4, wherein the generation of the virtual trajectory comprises the steps of:
step 1, initializing t =0, wherein the length of a virtual track is K;
step 2, selecting a state s from the real track t
Step 3, obtaining the combined action of the other opponents under the state s
Figure FDA0003808268180000021
Step 4, obtaining the action a of the current agent by using the strategy function of the current agent i =π i (s,a -i );
Step 5, using the model of the multi-agent environment to predict the state at the next moment
Figure FDA0003808268180000022
And the current time prize r t
Step 6, mixing(s) t ,a i ,a -i ,s t+1 ,r t ) Put into an experience playback pool
Figure FDA0003808268180000025
Performing the following steps;
and 7, repeating the step 1 until t > K by enabling t = t + 1.
6. The model-based multi-agent reinforcement learning method of claim 5, characterized in that the virtual trajectory is regenerated after modeling the multi-agent environment and opponent agent strategies to a set precision.
7. The model-based multi-agent reinforcement learning method of claim 6, characterized in that gaussian distributions are used to represent model outputs when modeling the multi-agent environment and the adversary agent policies, and multiple models are built for the multi-agent environment, the multi-agent environment model being used using an ensemble learning approach; let the number of environment models be B, then the environment model be
Figure FDA0003808268180000023
Wherein B belongs to {1, …, B }; the adversary strategy model is
Figure FDA0003808268180000024
Wherein j ∈ { -i }.
8. The model-based multi-agent reinforcement learning method of claim 7, wherein gradient descent is used to update in modeling the multi-agent environment and the opponent agent policy.
CN202011002376.8A 2020-09-22 2020-09-22 Multi-agent reinforcement learning method based on model Active CN112183288B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011002376.8A CN112183288B (en) 2020-09-22 2020-09-22 Multi-agent reinforcement learning method based on model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011002376.8A CN112183288B (en) 2020-09-22 2020-09-22 Multi-agent reinforcement learning method based on model

Publications (2)

Publication Number Publication Date
CN112183288A CN112183288A (en) 2021-01-05
CN112183288B true CN112183288B (en) 2022-10-21

Family

ID=73955716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011002376.8A Active CN112183288B (en) 2020-09-22 2020-09-22 Multi-agent reinforcement learning method based on model

Country Status (1)

Country Link
CN (1) CN112183288B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239629B (en) * 2021-06-03 2023-06-16 上海交通大学 Method for reinforcement learning exploration and utilization of trajectory space determinant point process
CN113599832B (en) * 2021-07-20 2023-05-16 北京大学 Opponent modeling method, device, equipment and storage medium based on environment model
CN116079747A (en) * 2023-03-29 2023-05-09 上海数字大脑科技研究院有限公司 Robot cross-body control method, system, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852448A (en) * 2019-11-15 2020-02-28 中山大学 Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
CN111330279A (en) * 2020-02-24 2020-06-26 网易(杭州)网络有限公司 Strategy decision model training method and device for game AI

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11586974B2 (en) * 2018-09-14 2023-02-21 Honda Motor Co., Ltd. System and method for multi-agent reinforcement learning in a multi-agent environment
CN110764507A (en) * 2019-11-07 2020-02-07 舒子宸 Artificial intelligence automatic driving system for reinforcement learning and information fusion
CN111324358B (en) * 2020-02-14 2020-10-16 南栖仙策(南京)科技有限公司 Training method for automatic operation and maintenance strategy of information system
CN111639809B (en) * 2020-05-29 2023-07-07 华中科技大学 Multi-agent evacuation simulation method and system based on leaders and panic emotion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110852448A (en) * 2019-11-15 2020-02-28 中山大学 Cooperative intelligent agent learning method based on multi-intelligent agent reinforcement learning
CN111330279A (en) * 2020-02-24 2020-06-26 网易(杭州)网络有限公司 Strategy decision model training method and device for game AI

Also Published As

Publication number Publication date
CN112183288A (en) 2021-01-05

Similar Documents

Publication Publication Date Title
Sun et al. A fast integrated planning and control framework for autonomous driving via imitation learning
CN112183288B (en) Multi-agent reinforcement learning method based on model
CN113110592B (en) Unmanned aerial vehicle obstacle avoidance and path planning method
Chen et al. Stabilization approaches for reinforcement learning-based end-to-end autonomous driving
CN109726804B (en) Intelligent vehicle driving behavior personification decision-making method based on driving prediction field and BP neural network
CN111580544B (en) Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
Grigorescu et al. Neurotrajectory: A neuroevolutionary approach to local state trajectory learning for autonomous vehicles
Naveed et al. Trajectory planning for autonomous vehicles using hierarchical reinforcement learning
Sun et al. Crowd navigation in an unknown and dynamic environment based on deep reinforcement learning
Lubars et al. Combining reinforcement learning with model predictive control for on-ramp merging
CN114013443A (en) Automatic driving vehicle lane change decision control method based on hierarchical reinforcement learning
CN113867354A (en) Regional traffic flow guiding method for intelligent cooperation of automatic driving of multiple vehicles
CN113511222A (en) Scene self-adaptive vehicle interactive behavior decision and prediction method and device
CN116679719A (en) Unmanned vehicle self-adaptive path planning method based on dynamic window method and near-end strategy
CN113391633A (en) Urban environment-oriented mobile robot fusion path planning method
CN113507717A (en) Unmanned aerial vehicle track optimization method and system based on vehicle track prediction
Wang et al. Efficient reinforcement learning for autonomous driving with parameterized skills and priors
CN114973650A (en) Vehicle ramp entrance confluence control method, vehicle, electronic device, and storage medium
Regier et al. Improving navigation with the social force model by learning a neural network controller in pedestrian crowds
Liu et al. Cooperative Decision-Making for CAVs at Unsignalized Intersections: A MARL Approach with Attention and Hierarchical Game Priors
CN116227622A (en) Multi-agent landmark coverage method and system based on deep reinforcement learning
CN116620327A (en) Lane changing decision method for realizing automatic driving high-speed scene based on PPO and Lattice
CN113353102B (en) Unprotected left-turn driving control method based on deep reinforcement learning
Sun et al. Event-triggered reconfigurable reinforcement learning motion-planning approach for mobile robot in unknown dynamic environments
CN114386620A (en) Offline multi-agent reinforcement learning method based on action constraint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant