CN113592100B

CN113592100B - Multi-agent reinforcement learning method and system

Info

Publication number: CN113592100B
Application number: CN202110863643.9A
Authority: CN
Inventors: 李厚强; 周文罡; 赵鉴; 胡迅晗
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2024-02-23
Anticipated expiration: 2041-07-29
Also published as: CN113592100A

Abstract

The invention discloses a multi-agent reinforcement learning method and system, which designs a multi-agent reinforcement learning framework of a centralized teacher module and a decentralised chemistry generation module, and decouples modules for solving reward distribution and local observation problems on the basis of meeting the conditions of centralized training and decentralised execution, so that the training efficiency of a model is improved. Furthermore, the framework is generic in that it can be used in all the methods proposed based on the decentralised execution paradigm of a centralised training. Meanwhile, based on the scheme, experiments are carried out in the interstellar dispute 2 in the main stream cooperative multi-agent reinforcement learning environment, and experimental results show that the scheme of the invention is superior to the existing method in performance and training efficiency.

Description

Multi-agent reinforcement learning method and system

Technical Field

The invention relates to the technical field of multi-agent reinforcement learning, in particular to a multi-agent reinforcement learning method and system.

Background

In recent years, collaborative multi-agent reinforcement learning technology has been rapidly developed, and has been deeply applied in the real fields of automatic driving automobiles, computer game scenes, sensor networks, robot groups and the like.

One simple way to solve such problems is to convert the collaborative multi-agent problem into a single-agent reinforcement learning problem, taking the joint state/action space of all agents as the state/action space of one virtual agent. We refer to a centrally-executed collaborative multi-agent reinforcement learning approach. However, in this approach, the joint state-action space grows exponentially with the number of agents. Furthermore, in many real-world environments, centralized execution becomes impractical due to the substantial portion of the agents and inter-agent communication limitations.

Another alternative is to train each agent as an independent individual, i.e. decentralized training. However, in the case of team rewards only, it is difficult to design an effective personal rewards function for different agents. In addition, decentralized training ignores coordination and cooperation among agents in a multi-agent system. Compared with decentralized training, the centralized training can access global information, eliminates communication constraint among agents, and is beneficial to better distributing team rewards.

The existing mainstream multi-agent reinforcement learning training paradigm is a centralized training and decentralized execution. In this paradigm, each agent's strategy is trained using global information in a centralized manner and is performed based on local information only in an decentralized manner. This paradigm aims to solve two key problems: 1) How to effectively distribute team rewards to each agent by using global state information in the centralized training process; 2) How to adapt the learned knowledge to the decentralized execution, which is conditioned only on local observations.

Disadvantage 1 of the prior art: the centering training and decentralizing execution paradigm uses one module to solve the two problems, which increases the difficulty of model learning.

Disadvantage 2 of the prior art: in the centralized training and decentralized execution paradigm, the state action value function of a single agent depends only on local information, which makes it difficult for the paradigm to fully utilize global state information in the centralized training.

The two defects directly affect the performance of the model, so that the model cannot well complete the task of an application scene where the model is located, for example, the model cannot accurately and effectively automatically drive an automobile in an automatic driving automobile scene, and the safety of passengers on the automobile and other vehicles on a road is affected; for example, automatic play is not possible in a computer game scenario, degrading the player's gaming experience.

Disclosure of Invention

The invention aims to provide a multi-agent reinforcement learning method and system, which are used for decoupling modules for solving rewards distribution and local observation problems, so that model training efficiency is improved, and model performance can be improved.

The invention aims at realizing the following technical scheme:

a multi-agent reinforcement learning method, comprising:

setting a centralized teacher module and a decentralised student module, wherein the teacher module and the student module comprise the same number of value function networks, each value function network is an estimated single-agent state action value function estimation network and corresponds to one agent in an application scene, and the teacher module is also provided with a mixed network for integrating all the state action value functions of the single agents into a centralized state action value function;

in the training stage, each value function network in the teacher module inputs global observation information at the current moment and actions obtained at the last moment of the corresponding value function network, outputs single-agent state action value functions, inputs all the single-agent state action value functions into a hybrid network, and the hybrid network carries out weighted mixing on all the single-agent state action value functions to obtain a process of a concentrated state action value function, wherein the hybrid network parameters express that the concentrated state action value function is more dependent on certain single-agent state action value functions in the current state, namely team rewards are more dependent on actions of certain agents, and the team rewards distribution is implicitly completed through a gradient feedback technology in training; meanwhile, each value function network in the student module inputs local observation information at the current moment, and actions obtained by the corresponding value function network at the last moment in the teacher module, and a knowledge distillation mechanism is adopted to enable the state action value function estimated by the teacher module to guide the student module to learn the local state action value function; the global observation information can represent all states in the whole application scene, and the local observation information represents a part of states in the whole application scene; the method comprises the steps of carrying out a first treatment on the surface of the

And in the execution stage, local observation information corresponding to the current moment of the intelligent agent is input by each value function network in the student module obtained through training, the state action value function of the corresponding single intelligent agent is output, and the action execution with the maximum state action value function is selected.

A multi-agent reinforcement learning system, the system comprising: a centralized teacher module and a decentralized student module; the teacher module and the student module comprise the same number of value function networks, wherein the value function networks are networks for estimating single-agent state action value functions and correspond to one agent in an application scene; the teacher module is also provided with a mixed network integrating the state action value functions of all the single agents into a concentrated state action value function;

in the training stage, each value function network in the teacher module inputs global observation information at the current moment and actions obtained at the moment on the corresponding value function network, outputs single-agent state action value functions, inputs all the single-agent state action value functions into a hybrid network, and the hybrid network carries out weighted mixing on all the single-agent state action value functions to obtain a process of a concentrated state action value function, wherein the hybrid network parameters express that the concentrated state action value function is more dependent on certain single-agent state action value functions in the current state, namely team rewards are more dependent on actions of certain agents, and the team rewards distribution is implicitly completed through a gradient feedback technology in training; meanwhile, each value function network in the student module inputs local observation information at the current moment and actions obtained at the moment on the corresponding value function network in the teacher module, and a knowledge distillation mechanism is adopted to enable the state action value function estimated by the teacher module to guide the student module to learn the local state action value function; the global observation information can represent all states in the whole application scene, and the local observation information represents a part of states in the whole application scene;

and in the execution stage, each value function network in the student module obtained through training outputs a state action value function of the single agent according to the local observation information of the corresponding numbered agent at the current moment, and selects the action execution with the maximum state action value function.

According to the technical scheme provided by the invention, a multi-agent reinforcement learning framework of the centralized teacher module and the decentralised chemical generation module is designed, and the modules for solving the problems of rewarding distribution and local observation are decoupled on the basis of meeting the conditions of centralized training and decentralised execution, so that the training efficiency of a model is improved, the performance of the model is improved, and the task of a scene is effectively executed; for example, the aforementioned application scenarios can improve the safety of passengers on the vehicle and other vehicles on the road, improve the game experience of the player, and the like. Furthermore, the framework is generic in that it can be used in all the methods proposed based on the decentralised execution paradigm of a centralised training.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a multi-agent reinforcement learning method according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

the terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The multi-agent reinforcement learning method provided by the invention is described in detail below. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer. The method mainly comprises the following principle: the idea of knowledge distillation is introduced and a multi-agent reinforcement learning framework based on centering teachers to center students is designed. The whole training frame is divided into the following three parts: 1) The teacher module distributes team rewards through centralized training according to the global information; 2) The student module approximates the state action value function of the single agent estimated by the teacher model by using the local information; 3) And through a knowledge distillation mechanism, the information learned by the teacher module is used for guiding the student module to learn.

As shown in fig. 1, a main framework for displaying the multi-agent reinforcement learning method mainly includes: the centralized teacher module (left side of fig. 1) and the decentralized student module (right side of fig. 1) are set, the teacher module and the student module contain the same number of value function networks (for example, both of the teacher module and the student module contain N value function networks in fig. 1), each value function network is a network for estimating a single-agent state action value function, and an arrow indicates a data flow direction during training/execution corresponding to one agent in an application scene.

1. And a teacher module.

The teacher module is provided with a value function network (namely a multi-layer perceptron, a gate control circulation unit and a multi-layer perceptron part on the left side in fig. 1) shared by a plurality of parameters, and is also provided with a mixed network for integrating the state action value functions of all the single agents into a concentrated state action value function.

In the embodiment of the invention, the teacher module only performs centering training and does not participate in the decentralization execution.

As shown in FIG. 1, the current observation information input by each value function network in the teacher module is global observation information, and can represent the whole systemAll states in the application scene, that is, all the observation information input by each value function network in the teacher module contains all the observation information in the application scene, for example, in a computer game scene, the observation information input by each value function network in the teacher module is the observation information obtained when the field of view is infinite (because the field of view is infinite, the observation information of all own parties and enemy units can be obtained); the observation information of the current moment input by each value function network in the teacher module is the same, namely in FIG. 1Is the same observation information, and is called centralized observation information. Centered observation information->The action a (obtained by the value function network in the teacher module) at the last moment is provided for the corresponding value function network in the teacher module, and the state action value function corresponding to the current moment is output>Wherein, the superscript indicates the number of the agent, the subscript t indicates the moment, and τ indicates the history track of the action state of the agent. Single agent state action value function for all agentsInput to a hybrid network, and output a concentrated state action value function through two layers of neural networks in the hybrid networkThe hybrid network optimizes using the time sequence differential loss of the output concentrated state action value function as a constraint. The mixed network is a process of weighting and mixing the single-agent state action value functions to obtain a concentrated state action value function, and the mixed network parameters express that the concentrated state action value function is more dependent on certain single-agent state action value functions in the current state, namely team rewards are more dependent on actions of certain agents, and the actions pass through spiritTeam rewards distribution is implicitly accomplished via gradient backhaul techniques in network training. According to the application scene, a task target is preset, and time sequence differential loss can be obtained by using the task target, so that a teacher module is trained. For example, in a computer game scenario, a host win is targeted, and during training, if the host win, a positive team prize is obtained, and conversely, a negative team prize is obtained, with the prize being used to derive the time-series differential loss.

2. And a student module.

The student module is provided with a plurality of value function networks (namely a multi-layer perceptron, a gate control circulating unit and a multi-layer perceptron part on the right side in fig. 1). The value function network in the student module is not shared with the value function network parameters in the teacher module.

In order to implement the decentralization execution, unlike the teacher module, in the student module, the input of the value function network is the observation information of a part of the application scene at the current moment, that is, the input of the value function network in the student module is the decentralization observation information (that is, the original observation information), so that the local information is more focused than the teacher module.

As shown in the dotted line box at the lower right corner of FIG. 1, the student module has two inputs when in operation, one of which is the observation information corresponding to the current time of each value function networkAs described above, the observation information of each value function network is local information, and only a part of states in the application scene at the current moment can be represented; the other part is the action decided by the function network of each value of the teacher module at the last moment>As will be appreciated by those skilled in the art, when the value function network makes action decisions, the action with the largest value function network is typically selected, and during the implementation phase, the student module is working with the required actionsThe system can come from an application scene or can be directly transmitted to a student module by a teacher module, and in the example shown in fig. 1, the student module performs actions required by work +.>From the application scenario, which in this example will also actA correspondingly numbered network of value functions (as indicated by the bottom arrow in fig. 1) is provided to the teacher module.

3. Knowledge distillation mechanism.

The teacher module fully utilizes the global information to reasonably distribute team rewards, and the student module pays attention to learning the optimal personal local state action value function so as to realize the decentralization execution. Therefore, a knowledge distillation mechanism is adopted to enable the teacher module to estimate the perfect state action value function so as to guide the student module to learn the local state action value function, and the input of the median function network of the student module is the observation information at the current moment and the action at the last moment during each training. In the embodiment of the invention, the mean square error loss is adopted to minimize the difference between the state action value functions estimated by the teacher module and the student module.

Experiments have shown that the local value function estimated by the student module is an unbiased estimate of the perfect state motion value function predicted by the teacher network.

4. Training and execution phases.

1) In the training stage, each value function network of the teacher module inputs the centralized observation information at the current moment and the action at the last moment, outputs single-agent state action value functions at the current moment, inputs all the single-agent state action value functions into a hybrid network, and the hybrid network carries out weighted mixing on all the single-agent state action value functions to obtain a process of the centralized state action value function, wherein the hybrid network parameters express that the centralized state action value function is more dependent on certain single-agent state action value functions in the current state, namely team rewards are more dependent on actions of certain agents, and the team rewards distribution is implicitly completed through a gradient feedback technology in training; meanwhile, a knowledge distillation mechanism is adopted to enable the state action value function estimated by the teacher module to guide the student module to learn the local state action value function, as shown in fig. 1, the teacher module and the number of the median function network of the student module are in one-to-one correspondence, so that the state action value function estimated by the median function network of the teacher module is used to guide the value function network of the same number in the student module to learn the local state action value function.

Parameters of the teacher module and the student module are updated simultaneously through iterative interaction, and the teacher module optimizes by taking time sequence differential loss as constraint; and taking the mean square error loss of the state action value function estimated by the student module and the teacher module as constraint to learn the parameters of the student module.

2) And in the execution stage, the relevant tasks are executed by using the training student modules.

As can be seen from the above description, the reinforcement learning scheme provided by the invention satisfies the conditions of centering training and decentralizing execution.

The scheme provided by the embodiment of the invention is applied to multi-agent scenes such as traffic vehicle control, computer games and the like. No matter which scene is applied, each value function network (namely an agent in the application scene) is used for acquiring pre-allocated information of a specified type, and the information belongs to observation information and represents the state in the current application scene; for example, when applied to traffic vehicle control, each agent is assigned in advance, and each agent controls a vehicle, and is used for acquiring current information of the vehicle, including: speed information, position information, distance from a front vehicle and a rear vehicle, and the like, which are all observation information; the method is applied to a computer game scene, each agent is also allocated in advance, each agent controls a game unit on own side, and related information in the game scene is acquired respectively, and the method comprises the following steps: information such as the position of the own game unit, the blood volume, the position of the own game unit and the enemy unit in the visual field, and the blood volume in the game scene belongs to observation information.

It should be noted that, when the teacher module median function network inputs global observation information, for example, when the teacher module median function network is applied to traffic vehicle control, the teacher module median function network inputs global information (including speed information, position information, distance from vehicles in front and behind, etc.). However, the information input by the median function network of the student module belongs to local observation information, and can only reflect the local state in the whole application scene, wherein the local and the global are relative concepts, the global represents that the total information quantity available in the scene can be obtained, and the local represents that only a part of the information quantity in the scene (such as a car and a game unit on own) can be obtained.

In the training phase, the input of the value function network of the student module is local observation information of the current moment (for example, in a computer game scene, each agent acquires information such as the current moment, own position, blood volume, the position and blood volume of own game units and enemy units in the visual field range, and the like). The value function network in the teacher module inputs global observation information (such as the position and blood volume of all game units of the own party, the position and blood volume of the own game units and enemy units in all visual fields and the like) at the current moment, and each value function network in the teacher module carries out action decision by combining the action of the value function network with the same number at the last moment (obtained by the value function network in the teacher module) with the global observation information at the current moment and outputs a single-agent state action value function; all the single-agent state action value functions are processed through the mixed network to obtain a concentrated state action value function, so that a teacher module is trained. Through training the intelligent body of the teacher module, and then through a knowledge distillation mechanism, the value function network of the teacher module guides the training of the value function network with the same number in the student module.

And in the execution stage, outputting a state action value function of the single agent according to the observation information of the agent with the same number at the current moment by each value function network in the student module obtained by training, and selecting the action execution with the maximum state action value function.

Taking the game scene of interstellar dispute 2 as an example, the own party comprises a plurality of combat units, each combat unit can correspond to an agent and is used for acquiring observation information (such as the position, blood volume and the like introduced above) related to the current moment, then determining corresponding actions (such as movement, attack and the like), adapting each action to a state action value function value, and selecting the action corresponding to the maximum state action value function value to execute. If the own party unit defeats all enemy units, the winning is achieved, otherwise the failure is achieved.

Based on the scheme, experiments are carried out in the interstellar dispute 2 in the mainstream cooperative multi-agent reinforcement learning environment, and experimental results show that the scheme of the invention exceeds the existing method in performance and training efficiency.

Another embodiment of the present invention further provides a multi-agent reinforcement learning system, referring to fig. 1, which mainly includes: a centralized teacher module and a decentralized student module; the teacher module and the student module comprise the same number of value function networks, wherein the value function networks are networks for estimating single-agent state action value functions and correspond to one agent in an application scene; the teacher module is also provided with a mixed network integrating the state action value functions of all the single agents into a concentrated state action value function;

in the training stage, the observation information input by each value function network in the teacher module is global observation information at the current moment, the single-agent state action value function at the current moment is output by combining the action at the last moment, all the single-agent state action value functions are input into a mixed network, the mixed network carries out the process of weighting and mixing all the single-agent state action value functions to obtain a concentrated state action value function, the mixed network parameters express that the concentrated state action value function is more dependent on certain single-agent state action value functions in the current state, namely, team rewards are more dependent on actions of certain agents, and the team rewards distribution is implicitly completed through a gradient feedback technology in training; meanwhile, the local observation information at the current moment and the action of the corresponding value function network of the teacher module are input into each value function network in the student module, and a knowledge distillation mechanism is adopted to enable the state action value function estimated by the teacher module to guide the student module to learn the local state action value function; the observation information is information representing the state of the current application scene; the application scene comprises: traffic vehicle control and computer game scenes;

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. A multi-agent reinforcement learning method, comprising:

In the execution stage, local observation information corresponding to the current moment of the intelligent agent is input through each value function network in the student module obtained through training, the state action value function of the corresponding single intelligent agent is output, and the action execution with the maximum state action value function is selected;

the application scene comprises: traffic vehicle control and computer game scenes; when the method is applied to traffic vehicle control, the observation information at least comprises the following steps: speed information, position information, and distance to the front and rear vehicles; when the computer game scene is applied, the observation information comprises a plurality of information as follows: the position of the own game unit, the blood volume, the position of the own game unit, the enemy unit and the blood volume in the visual field in the game scene;

in the teacher module, the global observation information is acquired from the system by the value function network input, and all the global observation information acquired by the value function network in the teacher module are the same; for a traffic vehicle control scene, each value function network is used for controlling a vehicle, and the global observation information comprises: speed information, position information, and distances from the front and rear vehicles of all vehicles; for a computer game scene, each value function network is used for controlling a game unit on the own side, and the global observation information comprises a plurality of information as follows: the position and blood volume of all game units of the own party in the game scene, the position and blood volume of the own party units and enemy party units in all visual fields;

distributing all value function networks in the student module in advance, wherein all value function networks only acquire distributed observation information used for representing the state of a part of application scenes; wherein, for traffic vehicle control scenarios, each value function network is used to control a car; the local observation information includes: speed information, position information and distance between the vehicle and the front and rear vehicles controlled by the value function network; for a computer game scene, each value function network is used for controlling a game unit on the own side, and the local observation information comprises a plurality of information as follows: the position of a game unit of a host controlled in a game scene, the blood volume, the positions of a host unit and an enemy unit in a visual field, and the blood volume.

2. The multi-agent reinforcement learning of claim 1, wherein the training phase, parameters of the teacher module and the student module are updated simultaneously through iterative interactions, and a hybrid network in the teacher module optimizes using a time sequence differential loss of an output centralized state action value function as a constraint; and taking the mean square error loss of the state action value function estimated by the student module and the teacher module as constraint to learn the parameters of the student module.

3. A multi-agent reinforcement learning system, the system comprising: a centralized teacher module and a decentralized student module; the teacher module and the student module comprise the same number of value function networks, wherein the value function networks are networks for estimating single-agent state action value functions and correspond to one agent in an application scene; the teacher module is also provided with a mixed network integrating the state action value functions of all the single agents into a concentrated state action value function;

in the execution stage, each value function network in the student module obtained by training outputs a state action value function of a single agent according to the local observation information of the corresponding numbered agent at the current moment, and selects the action execution with the maximum state action value function;