CN115797394A

CN115797394A - Multi-agent covering method based on reinforcement learning

Info

Publication number: CN115797394A
Application number: CN202211432494.1A
Authority: CN
Inventors: 孙新苗; 任明里; 丁大伟; 任莹莹; 王恒
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-03-14
Anticipated expiration: 2042-11-15
Also published as: CN115797394B

Abstract

The invention discloses a multi-agent covering method based on reinforcement learning, which comprises the following steps: determining the positions of a plurality of static agents in an area with the aim of maximizing coverage performance, and dividing the area into an effective coverage area and an ineffective coverage area according to the positions of the static agents; calculating the maximum coverage performance which can be obtained by the mobile intelligent agent; setting observations and actions of the mobile agent, and setting rewards for the mobile agent based on the maximum coverage performance that the mobile agent can achieve; each mobile agent aims at maximizing respective reward functions, and based on a reinforcement learning algorithm, a plurality of mobile agents interact with the environment at the same time to perform distributed training to obtain the motion planning of each mobile agent, so that coverage of areas which are not covered effectively is realized. The technical scheme of the invention can realize effective coverage of the multi-agent cooperation completion area and improve the coverage performance of the area.

Description

Multi-agent covering method based on reinforcement learning

Technical Field

The invention relates to the technical field of multi-agent system coverage optimization, in particular to a multi-agent coverage method based on reinforcement learning.

Background

With the rapid development of computers, micro-electro-mechanical systems, robotics and communication technologies, multi-agent systems are receiving more and more attention from people and are being applied to multiple fields such as coverage. Multi-agent zone coverage means that a plurality of agents form a team, and the whole zone is effectively covered through a cooperation strategy. The multiple intelligent agents cooperatively perform the area coverage task, so that the target task can be completed more efficiently, the limitation of the number and the angle of the sensors of the single intelligent agent can be overcome, and the intelligent agent coverage system has the characteristic of redundancy. At present, although the problem of full coverage of a region by multiple agents can be solved by the existing scheme, the coverage performance can not be improved while effective coverage is not realized.

Disclosure of Invention

The invention provides a multi-agent covering method based on reinforcement learning, which is used for quickly realizing effective covering of an area and improving the area covering performance.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the present invention provides a reinforcement learning-based multi-agent overlay method, the multi-agent comprising a plurality of stationary agents and a plurality of mobile agents, the multi-agent overlay method comprising:

determining the positions of a plurality of static agents in an area by taking the maximum coverage performance as a target, and dividing the area into an effective coverage area and an ineffective coverage area according to the positions of the static agents;

calculating the maximum coverage performance which can be obtained by the mobile intelligent agent;

setting the observation and action of each mobile agent on the environment, and setting the reward of the mobile agents based on the maximum coverage performance which can be obtained by the mobile agents; each mobile agent aims at maximizing respective reward functions, and based on a reinforcement learning algorithm, a plurality of mobile agents interact with the environment at the same time to perform distributed training, so that the motion planning of each mobile agent is obtained, and coverage of areas which are not covered effectively is realized.

Further, the determining the locations of the plurality of stationary agents in the area with the goal of maximizing coverage performance includes:

the position of a plurality of stationary agents in an area is adjusted such that the coverage performance is as large as possible.

Further, the calculation function H (S) of the coverage performance is as follows:

H(S)＝∫R(x)P(x,S)dx

wherein P (x, S) is the joint detection probability of the multi-agent at point x in the area,

p _i (x,s _i ) The detection probability of the ith agent, N is the number of agents, and R (x) is the event density function.

Further, when the area is divided into an effective coverage area and an ineffective coverage area, whether effective coverage exists in a point x in the area is judged according to whether the joint detection probability P (x, S) of the multi-agent at the position x is larger than a preset threshold value, when the probability P (x, S) is larger than the preset threshold value, effective coverage exists at the position x, and otherwise, the effective coverage does not exist at the position x.

Further, the observation of the mobile agent on the environment is three binary images; wherein the content of the first and second substances,

the first binary image represents an area which is not effectively covered currently;

the second binary image represents the position of the current mobile agent;

the third binary image represents the locations of other mobile agents in addition to the current mobile agent.

Further, the action set of the mobile agent is {0,1,2,3,4}, which respectively represents that the mobile agent is stationary, the mobile agent moves upwards, the mobile agent moves downwards, the mobile agent moves leftwards and the mobile agent moves rightwards.

Further, the Reward of the environment to the mobile agent is:

Reward＝(H _{at present} -H _max )/10+incres*30

Wherein H _{At present} Coverage performance of the mobile agent at the current location; h _max Maximum coverage performance that can be achieved for a mobile agent; incres is the area of the newly added effective coverage area compared with the last moment; the first portion of the reward represents the difference in coverage performance of the mobile agent at the current location from the maximum value, and the second portion of the reward is a newly increased effective coverage area compared to the previous time.

Furthermore, based on the reinforcement learning algorithm, a plurality of mobile agents interact with the environment at the same time, and when distributed training is carried out, an operator network and a critic network of the mobile agents are set to be two convolution layers and three full connection layers; the first layer of convolution layers in the network are 16 convolution kernels of 20 × 20, the second layer of convolution layers are 8 convolution kernels of 10 × 10, and the number of channels of the three layers of fully-connected layers is 256, 128 and 64 respectively.

In yet another aspect, the present invention also provides an electronic device comprising a processor and a memory; wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the above-described method.

In yet another aspect, the present invention also provides a computer-readable storage medium having at least one instruction stored therein, which is loaded and executed by a processor to implement the above-mentioned method.

The technical scheme provided by the invention has the beneficial effects that at least:

1. the invention can realize the effective coverage of the multi-agent cooperation completion area.

2. The invention utilizes the decision optimization capability of reinforcement learning, and can improve the coverage performance of the area while realizing effective coverage. The invention has the advantages of high efficiency and strong robustness.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of a reinforcement learning based multi-agent overlay method provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of a static agent location deployment provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of mobile agent and environment interaction provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of effective coverage as a function of step size provided by an embodiment of the present invention;

fig. 5 is a schematic diagram of coverage performance provided by an embodiment of the present invention as a function of step size.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

First embodiment

The embodiment provides a multi-agent covering method based on reinforcement learning. First, it should be noted that the multi-agent in this embodiment includes two types of agents, i.e., a stationary agent and a mobile agent, and by controlling the movement of the mobile agent, the effective coverage of the area is realized and the coverage performance of the area is improved.

Based on the above, the execution flow of the method of the embodiment is shown in fig. 1, and includes the following steps:

s1, with the aim of maximizing coverage performance, determining the positions of a plurality of static intelligent agents in an area, and dividing the area into an effective coverage area and an ineffective coverage area according to the positions of the static intelligent agents;

therein, it is required toIt is noted that in determining the location of a stationary agent, the goal should be to maximize coverage performance, i.e., adjust the location of the multi-agent S = (S) ₁ ,…,s _N ) Making the coverage performance function H (S) as large as possible, wherein the coverage performance of the multi-agent in the area is the integral of the product of the event density and the detection probability in the area, namely: h (S) = R (x) P (x, S) dx, wherein P (x, S) is the joint detection probability of the multi-agent system S at point x,

p _i (x,s _i ) The detection probability for the ith agent, typically x and s _i The distance between the two intelligent agents is a monotone decreasing function, N is the number of the intelligent agents, and R (x) is an event density function.

In particular, in the present embodiment, the location deployment of the static agents is as shown in fig. 2, where the grey area is the area where effective coverage has been achieved. And judging whether a point x in the area is effectively covered according to the judgment that whether the joint detection probability P (x, S) of the multi-agent at the point x is greater than a threshold value rho, when the P (x, S) is greater than the rho, indicating that the x is effectively covered, otherwise, indicating that the x is not effectively covered. After the inactive coverage areas are obtained, the mobile agent aims to cover the inactive coverage areas, namely P (x, S) is larger than or equal to rho at a certain moment, and the coverage performance H (S) is improved as much as possible in the moving process.

S2, calculating the maximum coverage performance which can be obtained by the mobile intelligent agent;

it should be noted that, in this embodiment, the maximum coverage performance H that can be obtained by the mobile agent is calculated _max I.e. maximizing the coverage performance H (S) =: (x) P (x, S) dx. The purpose is to be used for the calculation of the reward function of the mobile agent in the subsequent steps, wherein the condition that the number of the mobile agents is small can be calculated by a greedy algorithm generally, namely, the coverage performance is increased most by adding one mobile agent at a time on the basis of the static agent.

S3, setting the observation and action of each mobile intelligent agent on the environment, and setting the reward of the mobile intelligent agents based on the maximum coverage performance which can be obtained by the mobile intelligent agents; each mobile agent aims at maximizing respective reward functions, and based on a reinforcement learning algorithm, a plurality of mobile agents interact with the environment at the same time to perform distributed training, so that the motion planning of each mobile agent is obtained, and coverage of areas which are not covered effectively is realized.

It should be noted that the above steps are preparation steps for reinforcement learning and training of the mobile agent, fig. 3 is an interaction between three exemplary mobile agents and an environment, and before the reinforcement learning and training of the mobile agent, an action set of the agent, observation of the agent on the environment, and a reward function given to the agent by the environment need to be set. The environment is a grid environment in which a static agent exists, and a mobile agent can select 5 actions of static, upward movement, downward movement, left movement and right movement in the environment, for this, the action set of the mobile agent is set to {0,1,2,3,4}, which respectively represents static, upward movement, downward movement, left movement and right movement, and the movement distance is one lattice. In order to realize effective coverage of the areas cooperatively, observing the environment by the intelligent agent is set into three binary images, wherein the first binary image represents the area which is not effectively covered currently, and in the effectively covered grid mark 1, the effectively uncovered grid mark 0 is arranged; the second binary image shows the position of the current mobile agent, and the position of the current mobile agent is marked with a mark 1; and the third binary image shows the current positions of other mobile agents, and 1 is marked on grids where the other mobile agents are located. The environment is set to reward the agent in two parts, which respectively embody the goals of realizing effective coverage and improving the coverage performance. The Reward of the environment to the mobile agent is:

Reward＝(H _{at present} -H _max )/10+incres*30

Wherein H _{At present} Coverage performance of the mobile agent at the current location; incres is the area of the effective coverage area newly increased compared with the last time; the first portion of the reward represents the difference in coverage performance of the mobile agent at the current location from the maximum value, and the second portion of the reward is a newly increased effective coverage area compared to the previous time. The reward of the function is to quickly realize effective coverage and improve the coverage performance of the area.

Further, based on a reinforcement learning algorithm, a plurality of mobile agents interact with the environment at the same time, and when distributed training is carried out, an operator network and a critic network of the mobile agents are set to be two convolution layers and three full connection layers; the first convolution layer in the network is 16 convolution kernels of 20 × 20, the second convolution layer is 8 convolution kernels of 10 × 10, and the number of channels of the three fully-connected layers is 256, 128 and 64 respectively.

Further, when a plurality of mobile agents are trained simultaneously, the embodiment trains by using a near-end policy optimization algorithm (PPO), which is a model-free online policy gradient reinforcement learning method, and the specific process is as follows:

a. the actor π (A | S; θ) is initialized with a random parameter θ, and the critic V (S; φ) is initialized with a random parameter φ.

b. And generating N sections of experiences following the current strategy, wherein the experience sequence is as follows:

S _ts ,A _ts ,R _ts+1 ,…,S _ts+N-1 ,A _ts+N-1 ,R _ts+N ,S _ts+N

wherein, A _t Is in state S _t Action taken, S _t+1 Is the next state, R _t+1 Is state S _t Transfer to S _t+1 Awarding of the prize, the agent at S _t Where the probability of taking each action is calculated by pi (A | S; theta) and the action A is randomly selected based on the probability distribution _t 。

c. For each step t = ts +1, ts +2, \ 8230, ts + N, the return value G is calculated _t And an advantaged function D _t ，

δ _k ＝R _t +bγV(S _t ；φ),G _t ＝D _t +V(S _t (ii) a Phi) where, when S _ts+N In the termination state, b is 0, otherwise 1, λ is a smoothing coefficient, and γ is a discount coefficient.

d. Randomly acquiring small batch of data with the size of M from the current experience set, learning from the small batch of data, and performing a function of minimizing loss

To update the critic's parameters by minimizing the action loss function

To update the actor, where r _i (θ)＝π(A _i |S _i ；θ)/π(A _i |S _i ；θ _old ),c _i (θ)＝max(min(r _i (θ), 1+ ε), 1- ε) to facilitate the exploration of agents, an entropy loss function is added

e. Repeating b to d until the training termination condition is reached.

By executing the above steps, the change of the effective coverage area ratio with the step length after the training is completed is shown in fig. 4, so that it can be known that the coverage rate of the embodiment can reach 97%, and the change of the coverage performance with the step length is shown in fig. 5, which shows that the coverage performance can be improved in the process of realizing the effective coverage and after the effective coverage is realized.

In summary, the embodiment provides a multi-agent coverage method based on reinforcement learning, which can achieve effective coverage of a multi-agent cooperation completion area. And the method utilizes the decision optimization capability of reinforcement learning, and can improve the coverage performance of the area while realizing effective coverage. The method has the advantages of high efficiency and strong robustness.

Second embodiment

The present embodiment provides an electronic device, which includes a processor and a memory; wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the method of the first embodiment.

The electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) and one or more memories, where at least one instruction is stored in the memory, and the instruction is loaded by the processor and executes the method.

Third embodiment

The present embodiment provides a computer-readable storage medium, which stores at least one instruction, and the instruction is loaded and executed by a processor to implement the method of the first embodiment. The computer readable storage medium may be, among others, ROM, random access memory, CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. The instructions stored therein may be loaded by a processor in the terminal and perform the above-described method.

Furthermore, it should be noted that the present invention may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, an embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising one of \ 8230; \8230;" does not exclude the presence of additional like elements in a process, method, article, or terminal device that comprises the element.

Finally, it should be noted that while the above describes a preferred embodiment of the invention, it will be appreciated by those skilled in the art that, once having the benefit of the teaching of the present invention, numerous modifications and adaptations may be made without departing from the principles of the invention and are intended to be within the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. A multi-agent overlay method based on reinforcement learning, wherein the multi-agent comprises a plurality of stationary agents and a plurality of mobile agents, the multi-agent overlay method comprising:

2. The reinforcement learning-based multi-agent coverage method of claim 1, wherein said determining locations of a plurality of stationary agents in an area with the goal of maximizing coverage performance comprises:

3. The reinforcement learning-based multi-agent overlay method of claim 2, wherein the computation function H (S) of overlay performance is as follows:

H(S)＝∫R(x)P(x,S)dx

wherein P (x, S) is the joint detection probability of the multi-agent at point x in the region,

4. A reinforcement learning based multi-agent coverage method as claimed in claim 3, wherein in dividing the area into an existing effective coverage area and an ineffective coverage area, the judgment of whether a point x in the area has effective coverage is based on whether the joint detection probability P (x, S) of the multi-agent at x is larger than a preset threshold, when P (x, S) is larger than the preset threshold, it means that there has effective coverage at x, otherwise, there is no effective coverage at x.

5. The reinforcement learning-based multi-agent overlay method of claim 1, wherein the mobile agent's observations of the environment are three binary images; wherein, the first and the second end of the pipe are connected with each other,

the second binary image represents the position of the current mobile agent;

6. The reinforcement learning-based multi-agent overlay method of claim 5, wherein the set of actions for the mobile agent is {0,1,2,3,4}, representing a mobile agent stationary, a mobile agent moving up, a mobile agent moving down, a mobile agent moving left, and a mobile agent moving right, respectively.

7. The reinforcement learning-based multi-agent overlay method of claim 6, wherein the Reward of the environment to the mobile agent is:

Reward＝(H _{at present} -H _max )/10+incres*30

Wherein H _{At present} Coverage performance of the mobile agent at the current location; h _max Maximum coverage performance that can be achieved for a mobile agent; incres is the newly increased effective coverage area compared with the last moment; the first portion of the reward represents the difference in coverage performance of the mobile agent at the current location from the maximum value, and the second portion of the reward is a newly increased effective coverage area compared to the previous time.

8. A reinforcement learning based multi-agent coverage method as claimed in any one of claims 1 to 7, characterized in that, based on reinforcement learning algorithm, a plurality of mobile agents interact with the environment at the same time, when distributed training is carried out, the operator network and the critic network of the mobile agents are set to be two convolution layers plus three full connection layers; the first convolution layer in the network is 16 convolution kernels of 20 × 20, the second convolution layer is 8 convolution kernels of 10 × 10, and the number of channels of the three fully-connected layers is 256, 128 and 64 respectively.