CN114326749B

CN114326749B - Deep Q-Learning-based cluster area coverage method

Info

Publication number: CN114326749B
Application number: CN202210026133.0A
Authority: CN
Inventors: 袁国慧; 王卓然; 肖剑; 何劲辉
Original assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Current assignee: Yangtze River Delta Research Institute of UESTC Huzhou
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2023-10-13
Anticipated expiration: 2042-01-11
Also published as: CN114326749A

Abstract

The invention discloses a cluster area coverage method based on Deep Q-Learning, which comprises the following steps: establishing a dynamic model of a cluster system; determining neighbor sets of agents in the cluster; establishing a motion control model of the cluster system; constructing an information map and encoding the information map; defining a state space and a behavior space required by reinforcement learning according to the information map, and reporting a function; designing a network model required by a Deep Q-Learning algorithm; designing a Deep Q-Learning region coverage algorithm under the free region; and adjusting the obtained points as required to obtain a Deep Q-Learning area coverage algorithm under the obstructed area. The invention realizes the training and Learning of the cluster region coverage control algorithm by means of Deep Q-Learning technology, realizes the cluster region coverage under the free region and the obstacle region, effectively improves the cluster region coverage efficiency, and can ensure the stability of the algorithm under the weak communication environment.

Description

Deep Q-Learning-based cluster area coverage method

Technical Field

The invention belongs to the fields of multi-agent clusters and Q-Learning, and particularly relates to a cluster area coverage method based on Deep Q-Learning.

Background

The ideas of the multi-agent clusters are derived from observation and research of the natural animal cluster motion, for example, sharks can drive the fish shoal to the sea surface and then predate the fish shoal, and the wild goose group reduces the air resistance by maintaining a specific formation in the migration process, so that the multi-agent cluster is a bionic research. With the rise of artificial intelligence technology in recent years, intelligent control of robots, unmanned aerial vehicles, unmanned vehicles and the like has become a popular research field, and has made significant progress.

Cluster region coverage has important application and scientific research values, such as exploration of unknown regions, monitoring of target regions, and the like. The existing cluster area coverage method lacks effective utilization of historical coverage information, and the repeated coverage problem greatly reduces the operation efficiency of an algorithm, so that the efficiency of the area coverage algorithm is improved, the maximization of a search area is realized in the shortest time, and the method is an important research direction of multi-agent cluster search control.

Deep Q-Learning is an algorithm that uses Deep neural networks to replace the Q-value tables in traditional reinforcement Learning to optimize decisions. In the cluster region coverage process of the complex environment, multiple intelligent agents can learn the state and behavior characteristics by using the deep neural network and select strategies to plan the guide points. After the Deep Q-Learning algorithm is learned, an optimal guide point planning strategy can be obtained, so that the cluster can rapidly cover the target area.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides a cluster area coverage method based on Deep Q-Learning, which can realize the cluster area coverage under a free area and an obstacle area, effectively improve the cluster area coverage efficiency and ensure the stability of an algorithm under a weak communication environment.

The aim of the invention is realized by the following technical scheme: a cluster area coverage method based on Deep Q-Learning comprises the following steps:

a cluster area coverage method based on Deep Q-Learning comprises the following steps:

step S1, a dynamics model of a cluster system is established, n agents are contained in a cluster V, v= {1,2.., n }, an i-th agent in the cluster is defined as agent i, and a second-order dynamics model is defined as follows:

wherein p_i Is the position of agent i, v _i For the speed of agent i, u _i Is an agent for the intelligent agenti acceleration, n is the total number of agents in the cluster, and />Represents p _i 、v _i Deriving the relative time;

step S2, determining a neighbor set of the agents in the cluster, wherein when the distance between the two agents in the cluster is smaller than the communication distance, the two agents are considered to establish communication connection and share the position and the speed, and the neighbor set of the agent i is described as follows:

N _i ＝{j∈V：||p _j -p _i ||≤r _α ，j≠i}

wherein V represents a set of all agents; r is (r) _α Indicating the communication distance between agents, i·i is the euclidean norm, p _i Is the position, p, of agent i _j The position of agent j;

step S3, a motion control model of a cluster system is established, wherein alpha-agent represents an agent, beta-agent represents an obstacle detected by the agent, and gamma-agent represents a destination of the motion of the agent; respectively generating according to the alpha-agent, the beta-agent and the gamma-agentThe total motion control input of agent i is calculated as follows:

the method is used for ensuring that the clusters cannot collide with each other in the movement process;

is the obstacle avoidance control amount when the intelligent body moves in the space with the obstacle;

determining the movement direction of the intelligent body;

s4, quantifying an area to be traversed into a gamma-information map of M multiplied by L matrixes, wherein each quantized matrix center corresponds to one guide point gamma, converting complete search of the area into complete traversal of gamma points in the information map, all gamma points form a gamma-information map set of the agent i, and the agent i completes fusion updating of the information map set according to the information map set of the agent i and the information map set of a neighbor agent thereof to obtain a gamma-information map of the agent i and encodes the information map;

s5, defining a state space, a behavior space and a return function required by reinforcement learning according to the gamma-information map;

s6, designing a network model required by a Deep Q-Learning algorithm;

step S7, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6, determining a behavior selection strategy of the intelligent agent, continuously interacting with the environment through the behavior selection strategy and generating experience information, and training a Deep Q-Learning network by using the experience information;

and S8, designing a gamma point position adjustment strategy under the obstacle area, and adjusting the gamma point selected by the Deep Q-Learning network in the step S7 according to the requirement to obtain a Deep Q-Learning area coverage algorithm under the obstacle area.

In the above technical solution, in step 3:

the method is used for ensuring that clusters cannot collide with the intelligent bodies in the movement process, and is defined as follows:

wherein ,c^α A constant greater than zero; to ensure that the norms are everywhere differentiable, define the sigma norms:

differentiating the sigma norm to obtain:

n _ij is defined as follows:

φ _α (z) is a potential energy function defined as follows:

φ _α (z)＝ρ _h (z/r _α )φ(z-d _α )

wherein r_α ＝||r _a || _σ ，r _a Representing a communication distance between the agents; d, d _α ＝||d|| _σ D represents the ideal spacing between agents;

phi (z) is defined as follows:

wherein the following constraints are satisfied between a, b, c:

ρ _h (z) is defined as follows:

ρ _h (z) at phi _α The design in (z) is to ensure the smoothing of the potential energy function, the potential energy is an integral process:

a _ij (p) is an inter-agent adjacency matrix element defined as follows:

when the intelligent body moves in the space with the obstacle, the obstacle avoidance control quantity of the agent i is as followsThe definition is as follows:

wherein ,c^β Is a constant that is greater than zero and,set of obstacles detected for agent i, p _i，k and v_i，k Information indicating the position and speed of obstacle k detected by agent i, potential energy function φ _β (z) is defined as follows:

wherein d_β ＝||d _o || _σ ，d _o The ideal interval for the intelligent body to avoid the obstacle is shown;

the movement direction of the agent is determined as follows:

wherein , and />Proportional and differential control parameters, p, respectively, greater than zero _γ The position of the guide point gamma;

in the above technical solution, the step 4 includes:

the area to be traversed is a rectangular area with M multiplied by L, the area is quantized into gamma-information maps with M multiplied by L matrixes, the center of each quantized matrix corresponds to one guide point gamma, the complete search of the area is converted into complete traversal of gamma points in the information map, and all gamma points form a gamma-information map set of agent i:

m _i (γ)＝{γ _x，y }，x＝1，2，...，m，y＝1，2，...，l

wherein k and l are obtained by:

in the formula r_s Representing the perceived radius of agent i; if agent i traverses the position of guide point gamma, then record m _i (γ) =1, otherwise note m _i (γ)＝0；

And the agent i completes fusion updating of the information map according to the information map set of the agent i and the information map set of the neighbor agent, and an updating formula is defined as follows:

in the formula m_i (γ _x，y ) Gamma-information map, N, representing agent i _i Set of neighbor agents for agent i, m _s (γ _x，y ) All gamma-information maps of neighbor agents of agent i;

the gamma-information map of the intelligent agent is encoded, and the encoding process is as follows: the information map is vectorized according to columns, 8 binary data are sequentially and continuously taken out and hexadecimal coding is carried out on the binary data, and if the last remaining number is less than 8, 0 supplementing is needed to be carried out on the missing bits; after coding, each 8 binary data corresponds to 1 hexadecimal number, and when other intelligent agents receive the hexadecimal numbers, decoding is carried out according to the inverse operation of the coding process, and the hexadecimal numbers are restored into an original information map;

in the above technical solution, the step 5 includes:

defining a state space required by reinforcement learning, and for the agent i, the state construction method comprises the following steps: firstly, fusing the information maps of the agent i and the neighbor agents according to the update formula of the gamma-information map in the step S4; secondly, assigning a weight value 3 to the gamma point position of the agent i in the information map, and assigning a weight value 2 to the gamma point position of all neighbor agents; finally, linearly stretching the fused information map into a gray map with gray values of 0 to 255, namely, 0 in the information map corresponds to gray value 0, and 3 in the information map corresponds to gray value 255;

defining a behavior space required by reinforcement learning, wherein the behavior of an agent is represented by selecting a target gamma point, and the positions of the agent in a gamma-map and 8 optional gamma points around the positions are represented by 1 to 9, and the behavior space of agent i is defined as follows:

A _i ＝{1，2，3，4，5，6，7，8，9}

if agent i is located at the edge of the information map, the behavior space is A _i Is a subset of (a); in order to accelerate the training speed, the uncovered gamma points are selected as the optional behaviors of the intelligent agent in the training, and an optional behavior space A 'is generated' _i The definition is as follows:

A′ _i ＝{γ _x，y ∈A _i |m _i (γ _x，y )＝0}

the intelligent agent takes a gamma point corresponding to the behavior as a target point according to the selected behavior, and transmits the control quantity to the intelligent agent through the motion control model in the step S3 so as to enable the intelligent agent to move to the target point; in practical application, when formula |p _i -γ _x，y |＜∈ _d When meeting, the agent is determined to reach the gamma point and E _d Is an allowable distance error;

defining a return function required by reinforcement learning:

in the formula γ′_x，y The next gamma point selected for agent i, T is the time consumed by the gamma-information graph traversal or completion of the region coverage process, R (T) is defined as follows:

wherein ,and->Are all positive constants, r _ref For maximum return, T _min Is the theoretical minimum coverage time for an area, which is defined as follows:

wherein MxL represents the size of the target traversal region, mxl represents the size of the gamma-information map, v _max Is the maximum movement speed of the intelligent body;

in the above technical solution, the step 6 includes:

designing a network model required by a Deep Q-learning algorithm, setting the size of a convolution kernel to be 3 or 1 in order to avoid losing characteristic information in the convolution process, setting the step length of the convolution kernels of all convolution layers to be 1, and ensuring that the size of the output characteristic of each convolution layer is the same as that of an initial image by setting a padding parameter; in order to avoid the loss of the image characteristics in the pooling process, no pooling layer exists in the network structure; according to the principle, each layer in the Q network structure is designed as follows in sequence: the input dimension is 8×8×1, and the convolution kernel size is 3×3; the dimension of the convolution layer 1 is 8 multiplied by 32, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 2 is 8 multiplied by 64, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 3 is 8 multiplied by 128, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 4 is 8 multiplied by 128, and the convolution kernel size is 1 multiplied by 1; the dimension of the full connection layer is 64 multiplied by 1; the dimension of the output layer is 9 multiplied by 1;

the activation function selected in the training process is a ReLU function, the Loss function is Huber Loss, and the definition is as follows:

wherein s is the input state of the Q network, a is the behavior selected by the agent, Q (s, a) represents the current value network output, and Q (s ', a') represents the target value network output;

in the above technical scheme, step 7.1, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of steps S5 and S6, and determining a behavior selection policy of the agent: according to the optional behavior space A 'defined in the step S5' _i Whether or not it is an empty set, the behavior selection of agent i can be divided into two cases:

in one of the two ways,in the training process, in order to enable all states to be trained, an epsilon-greedy method is adopted to select behaviors, and the behavior selection is determined by the following formula:

wherein rand (1) represents a random number taken in (0, 1)Number of machines, Q _i (s _i ，a _i ) For the output of the target value network of agent i, function f _sample Representing the slave selectable behavior space A' _i Epsilon is an exploration variable defined as follows:

epsilon in the formula _start and ε_end Respectively represent the initial value and the final value of epsilon, sigma _ESP As an attenuation factor, episode_num is a round parameter episode in training, and epsilon is constantly 0 in the test process after training is completed;

and two, a second step of, in the second step,agent i can not reach the uncovered area no matter what action is selected, at this time, the uncovered gamma point closest to the point is selected, and a shortest path is selected, where the definition of action selection is as follows:

step 7.2, giving a Deep Q-Learning-based region coverage algorithm under a free region, enabling an intelligent agent to continuously interact with the environment through a behavior selection strategy and generate experience information, and training a Deep Q-Learning network by using the experience information;

the Deep Q-Learning based region coverage algorithm under the free region comprises the following steps:

step 7.2.1 initializing parameters λ, σ of Deep Q-Learning _ESP ，r _ref ，ε _start ，ε _end Control parameters of an agentParameter r of sensor _s D, parameters m, n of the information map, capacity C of experience pool D _max Batch parameter batch_size, network parameter update periodN _TU ；

Step 7.2.2, initializing an information map for all agents, initializing a current value and a target value network model, and updating parameters of the target value network model by adopting the following steps:

step 7.2.2.1 traversing all rounds of numbers episode=1→n _T Performing:

step 7.2.2.1.1, initializing the position and speed information of each agent;

step 7.2.2.1.2, initializing the state s, behavior a and gamma points of each agent;

step 7.2.2.1.3, if the current round does not complete the coverage, executing:

step 7.2.2.1.3.1, traversing all agents i=1→n:

step 7.2.2.1.3.2, the agent updates u according to the motion control model _i ,v _i and p_i ；

Step 7.2.2.1.3.3, updating the information map; calculate the obtained return r _i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' _i Updating state s _i ：＝s′ _i ；

Step 7.2.2.1.3.4, sample data (s _i ，a _i ，s′ _i ，r _i ) Storing the data into an experience pool D;

step 7.2.2.1.3.5, determining behavior a according to the behavior selection policy _i And will act a _i Converting into a corresponding gamma point;

step 7.2.2.1.3.6 if the number of samples in the experience pool is greater than the batch_size, randomly selecting batch_size data from D (s _i ，a _i ，s′ _i ，r _i ) Training the current value network with the selected sample as a sample, otherwise performing step 7.2.2.1.3.1;

step 7.2.2.1 if mod (epi-code, N _TU ) = 0, copying parameters of the current value network to the target value network;

in the above technical solution, the step 8 includes:

step 8.1, adjusting the gamma point selected by the Deep Q-Learning network in step 7 according to the requirement to obtain Deep Q-Lear in the obstructed areaDetermining a gamma point position adjustment strategy under the obstacle area by using a ning area coverage algorithm: in the case that the gamma point is covered by the obstacle, the covering of the gamma point is not needed to be considered, and m is _i (γ _x，y ) =1; for the situation that the obstacle approaches the gamma point, if the distance between the obstacle and the gamma point is smaller than the obstacle avoidance distance d of the intelligent body _o The position of the gamma point needs to be adjusted, and the grid area where the gamma point is positioned is taken as a new guide point gamma _o Is denoted as M _obs ，γ _o The point is determined by the following equation:

in the formula A set of obstacles detected for agent i; if->Let m be _i (γ) =1; if->Then a point of maximum coverage is selected according to the following equation:

in the formula D₁ Is the optimization area M _obs Obstacle region D in (a) ₂ Is the optimization area M _obs In unobstructed areas, gamma' _x，y Is a point coordinate that can replace the gamma point,is the optimal point for replacing gamma point;

the regional coverage under the obstacle region is similar to the regional coverage under the free region, when the distance between the obstacle and the gamma point is too short, the intelligent agent can realize the maximum coverage of the region by adjusting the position of the gamma point, so that if the Deep Q-Learning model under the free region is obtained through training, the Deep Q-Learning model under the obstacle region does not need to be trained;

step 8.2, a Deep Q-Learning based region coverage algorithm under the obstructed region comprises the following steps:

step 8.2.1, initializing the control parameter c of the agent ^α ，c ^β Etc., parameter r of sensor _s D, etc., parameters m, n, etc. of the information map;

step 8.2.2, initializing an obstacle area, and initializing an information map for all agents;

step 8.2.3, initializing the position and speed information of each agent;

step 8.2.4, initializing the state s and the behavior a and gamma points of each agent;

step 8.2.5, loading a network model which is trained and completed under a free area;

step 8.2.6, if the current round does not complete the coverage, executing:

step 8.2.6.1, traversing all agents i=1→n:

step 8.2.6.2, the agent updates u according to the motion control model _i ,v _i and p_i ；

Step 8.2.6.3, updating the information map; calculate the obtained return r _i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' _i Updating state s _i ：＝s′ _i ；

Step 8.2.6.4, determining behavior a according to the behavior selection policy _i Converting into a corresponding gamma point;

step 8.2.6.5 calculating the minimum obstacle avoidance distance d between the obstacle and the gamma point _γo ；

Step 8.2.6.6, if d _γo ＞d _o The gamma point position is adjusted by adopting the following steps, otherwise, the step 8.2.6.1 is executed;

step 8.2.6.6.1, build M _obs And calculating gamma _o ，γ _o Is a new guide point;

step 8.2.6.6.2 ifObtaining new +.>Point, otherwise let m _i (γ)＝1。

Because the invention adopts the technical scheme, the invention has the following beneficial effects:

the invention trains and learns the cluster area coverage control algorithm by means of Deep Q-Learning technology, realizes the cluster area coverage under the free area and the area with the obstacle, effectively improves the cluster area coverage efficiency, and can ensure the stability of the algorithm under the weak communication environment.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of an observation state construction process;

Detailed Description

The technical solution of the present invention will be described in further detail with reference to the accompanying drawings, but the scope of the present invention is not limited to the following description.

As shown in fig. 1, a Deep Q-Learning based cluster area coverage method includes the following steps:

wherein p_i Is the position of agent i, v _i For the speed of agent i, u _i Adding for agent iSpeed, n is the total number of agents in the cluster, and />Represents p _i 、v _i Deriving the relative time;

N _i ＝{j∈V：||p _j -p _i ||≤r _α ，j≠i}

determining the movement direction of the intelligent body;

s4, constructing an information map and encoding the information map;

step S5, defining a state space, a behavior space and a return function required by reinforcement learning according to the information map;

s6, designing a network model required by a Deep Q-Learning algorithm;

step S7, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6;

and S8, adjusting the gamma point obtained in the step S7 as required to obtain a Deep Q-Learning region coverage algorithm under the obstructed region.

In the above technical solution, in step 3:

differentiating the sigma norm to obtain:

n _ij is defined as follows:

φ _α (z) is a potential energy function defined as follows:

φ _α (z)＝ρ _h (z/r _α )φ(z-d _α )

phi (z) is defined as follows:

wherein the following constraints are satisfied between a, b, c:

ρ _h (z) is defined as follows:

a _ij (p) is an inter-agent adjacency matrix element defined as follows:

the movement direction of the agent is determined as follows:

in the above technical solution, the step 4 includes:

m _i (γ)＝{γ _x，y }，x＝1，2，...，m，y＝1，2，...，l

wherein k and l are obtained by:

in the above technical solution, the step 5 includes:

as shown in fig. 2, a state space required for reinforcement learning is defined, and for agent i, the state construction method is as follows: firstly, fusing the information maps of the agent i and the neighbor agents according to the update formula of the gamma-information map in the step S4; secondly, assigning a weight value 3 to the gamma point position of the agent i in the information map, and assigning a weight value 2 to the gamma point position of all neighbor agents; finally, linearly stretching the fused information map into a gray map with gray values of 0 to 255, namely, 0 in the information map corresponds to gray value 0, and 3 in the information map corresponds to gray value 255;

A _i ＝{1，2，3，4，5，6，7，8，9}

A′ _i ＝{γ _x，y ∈A _i |m _i (γ _x，y )＝0}

defining a return function required by reinforcement learning:

in the above technical solution, the step 6 includes:

in the above technical solution, the step 7 includes:

step 7.1, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6, and determining a behavior selection strategy of the intelligent agent: according to the optional behavior space A 'defined in the step S5' _i Whether or not it is an empty set, the behavior selection of agent i can be divided into two cases:

in one of the two ways,in order to enable all states to be trained, the epsilon-greedy method is used to select behaviors, which are determined by the following formula: />

Wherein rand (1) represents a random number, Q, taken in (0, 1) _i (s _i ，a _i ) For the output of the target value network of agent i, function f _sample Representing the slave selectable behavior space A' _i Epsilon is an exploration variable and is defined as follows:

epsilon in the formula _start and ε_end Respectively represent the initial value and the final value of epsilon, sigma _ESP As an attenuation factor, the epoode_num is a round parameter epoode in training;

step 7.2, determining a behavior a according to a behavior selection strategy by giving a Deep Q-Learning-based region coverage algorithm under a free region _i Converting into a corresponding gamma point, and copying parameters of a current value network to a target value network;

step 7.2.1 initializing parameters λ, σ of Deep Q-Learning _ESP ，r _ref ，ε _start ，ε _end Control parameters of an agentParameter r of sensor _s D, parameters m, n of the information map, capacity C of experience pool D _max Batch parameter batch_size;

step 7.2.2.1 traversing all rounds of numbers episode=1→n _T Performing:

step 7.2.2.1.1, initializing the position and speed information of each agent;

step 7.2.2.1.2, initializing the state s and behavior a of each agent;

step 7.2.2.1.3, if the current round number is less than or equal to the total round number N _T Performing:

step 7.2.2.1.3.1, traversing all agents i=1→n:

step 7.2.2.1.3.2, calculating u according to the motion control model of the agent _i ,v _i and p_i ；

Step 7.2.2.1.3.3, updating the information map; calculate the obtained return r _i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' _i ；

Step 7.2.2.1.3.4, sample data (s _i ，a _i ，s′ _i ，r _i ) Storing the data into an experience pool D; the method comprises the steps of carrying out a first treatment on the surface of the

Step 7.2.2.1.3.5, determining behavior a according to the behavior selection policy _i And will act a _i Converted into corresponding gamma point, and the intelligent agent is according to u _i ,v _i and p_i Move to the gamma point and update the state s after the movement is finished _i ：＝s′ _i ；

Step 7.2.2.1.3.6 if the number of samples in the experience pool is greater than the batch_size, randomly selecting batch_size data from D (s _i ，a _i ，s′ _i ，r _i ) Training the current value network with the selected sample as a sample, otherwise performing step 7.2.2.1.3.2;

step 7.2.2.1 if mod (epi-code, N _TU ) = 0, copy parameters of the current value network to the target value network, N _TU Is an update period;

in the above technical solution, the step 8 includes:

and (3) adjusting the gamma point obtained in the step (S7) as required to obtain a Deep Q-Learning region coverage algorithm under the obstacle region, and determining a gamma point position adjusting method: in the case that the gamma point is covered by the obstacle, the covering of the gamma point is not needed to be considered, and m is _i (γ _x，y ) =1; for the situation that the obstacle approaches the gamma point, if the distance between the obstacle and the gamma point is smaller than the obstacle avoidance distance d of the intelligent body _o The position of the gamma point needs to be adjusted, and the grid area where the gamma point is positioned is taken as a new guide point gamma _o Is denoted as M _obs ，γ _o The point is determined by the following equation:

based on the steps S1 to S8, a Deep Q-Learning based area coverage algorithm under the obstacle area is given as shown in table-2:

TABLE-2 Deep Q-Learning based region coverage algorithm under obstructed regions

/>

The invention realizes the training and Learning of the cluster region coverage control algorithm by means of Deep Q-Learning technology, realizes the cluster region coverage under the free region and the obstacle region, effectively improves the cluster region coverage efficiency, and can ensure the stability of the algorithm under the weak communication environment.

The foregoing is a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein, but is not to be construed as limited to other embodiments, but is capable of other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept, either as a result of the foregoing teachings or as a result of the knowledge or knowledge of the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. The cluster area coverage method based on Deep Q-Learning is characterized by comprising the following steps of:

wherein p_i Is the position of agent i, v _i For the speed of agent i, u _i Acceleration for agent i, n is the total number of agents in the cluster, and />Represents p _i 、v _i Deriving the relative time;

N _i ＝{j∈V：||p _j -p _i ||≤r _α ，j≠i}

determining the movement direction of the intelligent body;

s6, designing a network model required by a Deep Q-Learning algorithm;

2. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: in step 3:

wherein ,c^a A constant greater than zero; to ensure that the norms are everywhere differentiable, define the sigma norms:

differentiating the sigma norm to obtain:

n _ij is defined as follows:

φ _α (z) is a potential energy function defined as follows:

φ _α (z)＝ρ _h (z/r _a )φ(z-d _α )

phi (z) is defined as follows:

wherein the following constraints are satisfied between a, b, c:

ρ _h (z) is defined as follows:

a _ij (p) is an inter-agent adjacency matrix element defined as follows:

wherein ,c^β Is a constant that is greater than zero and,set of obstacles detected for agent i, p _i，k and v_i，k Information representing the position and velocity of obstacle k detected by agent i, potential energy function phi _β (z) is defined as follows:

the movement direction of the agent is determined as follows:

wherein , and />Proportional and differential control parameters, p, respectively, greater than zero _γ The position of the guide point γ is indicated.

3. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 4 comprises the following steps:

m _i (y)＝{γ _x，y }，x＝1，2，...，m，y＝1，2，...，l

wherein k and l are obtained by:

the gamma-information map of the intelligent agent is encoded, and the encoding process is as follows: the information map is vectorized according to columns, 8 binary data are sequentially and continuously taken out and hexadecimal coding is carried out on the binary data, and if the last remaining number is less than 8, 0 supplementing is needed to be carried out on the missing bits; after coding, every 8 binary data corresponds to 1 hexadecimal number, and when other intelligent agents receive the hexadecimal number, decoding is carried out according to the inverse operation of the coding process, and the hexadecimal number is restored into the original information map.

4. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 5 comprises the following steps:

A _i ＝{1，2，3，4，5，6，7，8，9}

A′ _i ＝{γ _x，y ∈A _i |m _i (γ _x，y )＝0}

defining a return function required by reinforcement learning:

wherein MxL represents the size of the target traversal region, mxl represents the size of the gamma-information map, v _max Is the maximum movement speed of the agent.

5. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 6 comprises the following steps:

where s is the input state of the Q network, a is the behavior selected by the agent, Q (s, a) represents the current value network output, and Q (s ', a') represents the target value network output.

6. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein:

step 7.1, designing the free region based on the results of steps S5 and S6The Deep Q-Learning region coverage algorithm of (2) determines a behavior selection strategy of the agent: according to the optional behavior space A 'defined in the step S5' _i Whether or not it is an empty set, the behavior selection of agent i can be divided into two cases:

wherein rand (1) represents a random number, Q, taken in (0, 1) _i (s _i ，a _i ) For the output of the target value network of agent i, function f _sample Representing the slave selectable behavior space A' _i Epsilon is an exploration variable defined as follows:

step 7.2.1 initializing parameters λ, σ of Deep Q-Learning _ESP ，r _ref ，ε _start ，ε _end Control parameter c of agent ^a ，Parameter r of sensor _s D, parameters m, n of the information map, capacity C of experience pool D _max Batch parameter batch_size, network parameter update period N _TU ；

step 7.2.2.1 traversing all rounds of numbers episode=1→n _T Performing:

step 7.2.2.1.1, initializing the position and speed information of each agent;

step 7.2.2.1.3, if the current round does not complete the coverage, executing:

step 7.2.2.1.3.1, traversing all agents i=1→n:

step 7.2.2.1 if mod (epi-code, N _TU ) = 0, copy parameters of the current value network to the target value network.

7. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 8 includes:

step 8.1, adjusting the gamma point selected by the Deep Q-Learning network in step S7 according to the requirement to obtain a Deep Q-Learning area coverage algorithm under the obstacle area, and determining a gamma point position adjustment strategy under the obstacle area: in the case that the gamma point is covered by the obstacle, the covering of the gamma point is not needed to be considered, and m is _i (γ _x，y ) =1; for the situation that the obstacle approaches the gamma point, if the distance between the obstacle and the gamma point is smaller than the obstacle avoidance distance d of the intelligent body _o The position of the gamma point needs to be adjusted, and the grid area where the gamma point is positioned is taken as a new guide point gamma _o Is denoted as M _obs ，γ _o The point is determined by the following equation:

step 8.2.1, initializing the control parameter c of the agent ^a ，c ^β ，Etc., parameter r of sensor _s D, etc., parameters m, n, etc. of the information map;

step 8.2.3, initializing the position and speed information of each agent;

step 8.2.6, if the current round does not complete the coverage, executing:

step 8.2.6.1, traversing all agents i=1→n:

Step 8.2.6.6, if d _γo ＜d _o The gamma point position is adjusted by adopting the following steps, otherwise, the step 8.2.6.1 is executed;

step 8.2.6.6.2 ifObtaining new +.>Point, otherwise let m _i (γ)＝1。