CN113705778A

CN113705778A - Air multi-agent training method and device

Info

Publication number: CN113705778A
Application number: CN202110904682.9A
Authority: CN
Inventors: 彭宣淇; 朴海音; 孙智孝; 韩玥; 杨晟琦; 樊松源; 孙阳; 于津; 田明俊; 金琳乘
Original assignee: Shenyang Aircraft Design and Research Institute Aviation Industry of China AVIC
Current assignee: Shenyang Aircraft Design and Research Institute Aviation Industry of China AVIC
Priority date: 2021-08-07
Filing date: 2021-08-07
Publication date: 2021-11-26

Abstract

The application discloses an air multi-agent training method and device. The air multi-agent training method comprises the following steps: acquiring a multi-agent decision model, wherein the multi-agent decision model comprises action execution step length probability distribution information and a bias factor; and training the multi-agent decision model, wherein the multi-agent decision model is influenced by the bias factor when the action execution step length in the action execution step length probability distribution information is selected in the multi-agent decision model training process. By adding the bias factor, the intelligent agent can autonomously decide the action execution time, exponentially reduces the difficulty of exploring the environment and acquiring the reward value of the intelligent agent, and the intelligent agent can learn the macro strategy more easily.

Description

Air multi-agent training method and device

Technical Field

The application relates to the technical field of unmanned aerial vehicles, in particular to an aerial multi-agent training method and an aerial multi-agent training device.

Background

The multi-agent reinforcement learning is a hot point of research in the field of reinforcement learning at present, mainly solves the problem of strategy evolution of a plurality of agents in the environment, and is relatively fit with the decision process of the real world. In the process of solving the practical control problem by using the reinforcement learning method, the large solution space and the sparse reward are important reasons influencing the learning efficiency of the intelligent agent.

The prior art adopts a way of designing an internal reward mechanism to solve the above problems, specifically:

internal rewards mean the reward of the agent to itself as opposed to external rewards given with the environment. Internal rewards may encourage the agent to explore unknown environments, may help the agent explore more behaviors, and avoid local optimality to some extent. But has limited effect on the initial exploration of the environment and the reduction of the training difficulty.

Accordingly, a technical solution is desired to overcome or at least alleviate at least one of the above-mentioned drawbacks of the prior art.

Disclosure of Invention

It is an object of the present invention to provide an over the air multi-agent training method that overcomes or at least alleviates at least one of the above-mentioned disadvantages of the prior art.

In one aspect of the present invention, there is provided an airborne multi-agent training method, including:

acquiring a multi-agent decision model, wherein the multi-agent decision model comprises action execution step length probability distribution information and a bias factor;

and training the multi-agent decision model, wherein the multi-agent decision model is influenced by the bias factor when the action execution step length in the action execution step length probability distribution information is selected in the multi-agent decision model training process.

Optionally, the air multi-agent training method further comprises:

in training the multi-agent decision model, the bias factor value is changed according to a preset condition.

Optionally, the changing the value of the bias factor according to a preset condition includes:

obtaining the times of the training rounds, reducing the value of the bias factor when the times of the training rounds are more than the preset times, and/or,

and acquiring an average reward value, and reducing the value of the bias factor when the average reward value is larger than a preset reward value.

Optionally, before the training the multi-agent decision model, the over-the-air multi-agent training method further comprises:

obtaining a plurality of multi-agent training samples;

and training the aerial multi-agent decision model according to the multi-agent training samples.

Optionally, each of the multi-agent training samples comprises current state information, execution action information, step size information of action execution, and the environment obtains reward value information, next state information of the current state, and information of whether to end in each step.

Optionally, the multi-agent decision model comprises a multi-agent policy network and a value network.

The present application further provides an aerial multi-agent training device, comprising:

a model acquisition module for acquiring a multi-agent decision model comprising action execution step size probability distribution information and a bias factor;

a training module, configured to train the multi-agent decision model, where, during training of the multi-agent decision model, the multi-agent decision model is affected by the bias factor when selecting an action execution step length in the action execution step length probability distribution information.

The present application also provides an electronic device comprising a memory, a processor and a computer program stored in the memory and being executable on the processor, the processor implementing the method of airborne multi-agent training as described above when executing the computer program.

The present application also provides a computer readable storage medium having stored thereon a computer program enabling, when executed by a processor, the method of airborne multi-agent training as described above.

Advantageous effects

The air multi-agent training method has the following advantages:

1. by adding the bias factor, the intelligent agent can autonomously decide the action execution time, exponentially reduces the difficulty of exploring the environment and acquiring the reward value of the intelligent agent, and is easier to learn the macro strategy;

2. after the agent has a certain degree of skill accumulation, the agent can gradually use lower decision time by changing the bias factor, explore a more accurate control method and obtain a higher reward value;

3. for the probability distribution of action duration time selected and maintained by the intelligent agent, external intervention can be performed under the condition of maintaining continuous gradient of a decision network of the intelligent agent, so that the intelligent agent changes a bias factor, and the intelligent agent is more inclined to adopt longer decision duration time to complete exploration of the environment and simultaneously does not influence the network to train by adopting a gradient descent method;

4. the algorithm can be expanded to the field of multi-agent deep reinforcement learning, different individuals in different agents are allowed to make decisions by adopting different time step lengths, and the training method of the algorithm is applicable to the scene, namely, the asynchronous decisions of the individuals in the agents are compatible, so that the application limit of the algorithm in the field of the multi-agents is reduced;

5. the bias factors can be manually adjusted, and corresponding auxiliary reward mechanisms are designed. The agent is encouraged to have a greater probability of selecting a longer decision-keeping time in the early stage, and the agent is encouraged to do micro-manipulations after a period of training. Therefore, the training and strategy iteration process from macro to precision is realized from easy to difficult.

Drawings

Fig. 1 is a flowchart illustrating an airborne multi-agent training method according to an embodiment of the present application.

FIG. 2 is an exemplary block diagram of an electronic device capable of implementing functionality provided in accordance with one embodiment of the present application.

Detailed Description

In order to make the implementation objects, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are a subset of the embodiments in the present application and not all embodiments in the present application. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

It should be noted that the terms "first" and "second" in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The method for multi-agent training in the air as shown in fig. 1 comprises the following steps:

step 1: acquiring a multi-agent decision model, wherein the multi-agent decision model comprises action execution step length probability distribution information and a bias factor;

step 2: and training a multi-agent decision model, wherein in the training of the multi-agent decision model, the multi-agent decision model is influenced by a bias factor when selecting the action execution step length in the action execution step length probability distribution information.

The air multi-agent training method has the following advantages:

In this embodiment, the method for multi-agent over the air training further comprises: in training a multi-agent decision model, the bias factor value is changed according to preset conditions.

In this embodiment, the changing the value of the bias factor according to the preset condition includes:

In the embodiment, in the decision making process, the bias is added to the probability distribution of the action execution step length taken by the intelligent agent, so that the intelligent agent selects a large time step length as much as possible in the initial training process, the reward training difficulty is facilitated, and the intelligence can explore the environment more quickly to obtain the reward value;

the action selection bias is dynamically adjusted as the training rounds increase and the average prize value is obtained. The bias is gradually reduced, and the agent can explore the environment in a shorter time step to make an agile decision.

In this embodiment, before training the multi-agent decision model, the over-the-air multi-agent training method further includes:

obtaining a plurality of multi-agent training samples;

In this embodiment, each multi-agent training sample includes the following elements: state S, execution of action a, step length t of action execution_mThe environment obtains a reward value r in each step_iNext state S', whether done is finished.

In this embodiment, the multi-agent decision model includes a multi-agent policy network and a value network. The multi-agent strategy network input is the current state, and the output is the action taken and the action execution time, wherein the action execution time is obtained through the action execution step size probability distribution information and the bias factor. The value network inputs the current state and outputs the value corresponding to the current state

In this embodiment, the training of the multi-agent decision model specifically includes: after obtaining samples meeting the training requirement sample size, training is started, and when a decision network under an Actor-Critic framework is adopted, the updating method of the value network V _ net and the strategy network P _ net is as follows:

in the formula, S_iState values given for the current environment, r_iGiving a reward value for the environment at the ith step length, wherein gamma is a discount factor; v _ net (S)_i) Denotes S_iValue of the value network output in the state, p (a)_i|S_i) And p (t)_m|S_i) Are respectively at S_iIn the state, the agent selects the current maneuver a_iSelecting a maneuver duration t_mThe probability of (c).

In this embodiment, the action execution step length of the multi-agent decision model in the probability distribution information of the selected action execution step length influenced by the bias factor is specifically:

the bias factor is a probability factor, and the value of the bias factor is set, so that the multi-agent decision model can perform biased selection when selecting the action execution step length in the action execution step length probability distribution information, for example, when the value of the bias factor is higher, the multi-agent decision model can select more large time step lengths when selecting the action execution step length in the action execution step length probability distribution information, and when the value of the bias factor is smaller, the multi-agent decision model can select more small time step lengths when selecting the action execution step length in the action execution step length probability distribution information.

The application also provides an aerial multi-agent training device, which comprises a model acquisition module and a training module, wherein the model acquisition module is used for acquiring a multi-agent decision model, and the multi-agent decision model comprises action execution step length probability distribution information and a bias factor; the training module is used for training the multi-agent decision model, and the multi-agent decision model is influenced by the bias factor when the action execution step length in the action execution step length probability distribution information is selected in the multi-agent decision model training process.

It should be noted that the foregoing explanations of the method embodiments also apply to the apparatus of this embodiment, and are not repeated herein.

The present application also provides an electronic device comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, the processor implementing the above method of airborne multi-agent training when executing the computer program.

The present application also provides a computer readable storage medium having stored thereon a computer program enabling, when executed by a processor, the above method of airborne multi-agent training.

As shown in fig. 2, the electronic device includes an input device 501, an input interface 502, a central processor 503, a memory 504, an output interface 505, and an output device 506. The input interface 502, the central processing unit 503, the memory 504 and the output interface 505 are connected to each other through a bus 507, and the input device 501 and the output device 506 are connected to the bus 507 through the input interface 502 and the output interface 505, respectively, and further connected to other components of the electronic device. Specifically, the input device 504 receives input information from the outside and transmits the input information to the central processor 503 through the input interface 502; the central processor 503 processes input information based on computer-executable instructions stored in the memory 504 to generate output information, temporarily or permanently stores the output information in the memory 504, and then transmits the output information to the output device 506 through the output interface 505; the output device 506 outputs the output information to the outside of the electronic device for use by the user.

That is, the electronic device shown in fig. 2 may also be implemented to include: a memory storing computer-executable instructions; and one or more processors which, when executing the computer executable instructions, may implement the over-the-air multi-agent training method described in connection with fig. 1.

In one embodiment, the electronic device shown in fig. 2 may be implemented to include: a memory 504 configured to store executable program code; one or more processors 503 configured to execute executable program code stored in memory 504 to perform the over-the-air multi-agent training method of the above embodiments.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media include both non-transitory and non-transitory, removable and non-removable media that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Furthermore, it will be obvious that the term "comprising" does not exclude other elements or steps. A plurality of units, modules or devices recited in the device claims may also be implemented by one unit or overall device by software or hardware. The terms first, second, etc. are used to identify names, but not any particular order.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks identified in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The Processor in this embodiment may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store computer programs and/or modules, and the processor may implement various functions of the apparatus/terminal device by running or executing the computer programs and/or modules stored in the memory, as well as by invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

In this embodiment, the module/unit integrated with the apparatus/terminal device may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, all or part of the flow in the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium and used by a processor to implement the steps of the above-described embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like.

It should be noted that the computer readable medium may contain content that is appropriately increased or decreased as required by legislation and patent practice in the jurisdiction. Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application.

Although the invention has been described in detail hereinabove with respect to a general description and specific embodiments thereof, it will be apparent to those skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. An aerial multi-agent training method, characterized in that the aerial multi-agent training method comprises:

2. The air multi-agent training method of claim 1, wherein said air multi-agent training method further comprises:

3. The over-the-air multi-agent training method of claim 2, wherein said varying the bias factor value according to preset conditions comprises:

4. The over-the-air multi-agent training method of claim 3, wherein prior to said training said multi-agent decision model, said over-the-air multi-agent training method further comprises:

obtaining a plurality of multi-agent training samples;

5. The over-the-air multi-agent training method as recited in claim 4, wherein each of said multi-agent training samples comprises current state information, execution action information, step size information of action execution, environment obtains bonus value information, next state information of current state, end or not information in each step.

6. The over-the-air multi-agent training method of claim 5, wherein the multi-agent decision model comprises a multi-agent policy network and a value network.

7. An airborne multi-agent training device, characterized in that it comprises:

8. An electronic device, characterized in that the electronic device comprises a memory, a processor and a computer program stored in the memory and being executable on the processor, the processor implementing the method for airborne multi-agent training as claimed in any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, characterized in that it stores a computer program which, when being executed by a processor, is capable of implementing the method of airborne multi-agent training as claimed in any one of claims 1 to 6.