CN114326749B - Deep Q-Learning-based cluster area coverage method - Google Patents
Deep Q-Learning-based cluster area coverage method Download PDFInfo
- Publication number
- CN114326749B CN114326749B CN202210026133.0A CN202210026133A CN114326749B CN 114326749 B CN114326749 B CN 114326749B CN 202210026133 A CN202210026133 A CN 202210026133A CN 114326749 B CN114326749 B CN 114326749B
- Authority
- CN
- China
- Prior art keywords
- agent
- gamma
- learning
- deep
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 69
- 230000006399 behavior Effects 0.000 claims abstract description 72
- 230000006870 function Effects 0.000 claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 28
- 238000004891 communication Methods 0.000 claims abstract description 16
- 230000002787 reinforcement Effects 0.000 claims abstract description 14
- 230000008569 process Effects 0.000 claims description 34
- 238000005381 potential energy Methods 0.000 claims description 12
- 238000010276 construction Methods 0.000 claims description 9
- 230000009471 action Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 230000004927 fusion Effects 0.000 claims description 5
- 241000970807 Thermoanaerobacterales Species 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 238000013459 approach Methods 0.000 claims description 3
- 238000013461 design Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 230000001502 supplementing effect Effects 0.000 claims description 3
- 230000001133 acceleration Effects 0.000 claims description 2
- 238000012360 testing method Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 4
- 239000003795 chemical substances by application Substances 0.000 description 165
- 238000011160 research Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 241000251468 Actinopterygii Species 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 241000272814 Anser sp. Species 0.000 description 1
- 241000251730 Chondrichthyes Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000011664 nicotinic acid Substances 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Feedback Control In General (AREA)
Abstract
The invention discloses a cluster area coverage method based on Deep Q-Learning, which comprises the following steps: establishing a dynamic model of a cluster system; determining neighbor sets of agents in the cluster; establishing a motion control model of the cluster system; constructing an information map and encoding the information map; defining a state space and a behavior space required by reinforcement learning according to the information map, and reporting a function; designing a network model required by a Deep Q-Learning algorithm; designing a Deep Q-Learning region coverage algorithm under the free region; and adjusting the obtained points as required to obtain a Deep Q-Learning area coverage algorithm under the obstructed area. The invention realizes the training and Learning of the cluster region coverage control algorithm by means of Deep Q-Learning technology, realizes the cluster region coverage under the free region and the obstacle region, effectively improves the cluster region coverage efficiency, and can ensure the stability of the algorithm under the weak communication environment.
Description
Technical Field
The invention belongs to the fields of multi-agent clusters and Q-Learning, and particularly relates to a cluster area coverage method based on Deep Q-Learning.
Background
The ideas of the multi-agent clusters are derived from observation and research of the natural animal cluster motion, for example, sharks can drive the fish shoal to the sea surface and then predate the fish shoal, and the wild goose group reduces the air resistance by maintaining a specific formation in the migration process, so that the multi-agent cluster is a bionic research. With the rise of artificial intelligence technology in recent years, intelligent control of robots, unmanned aerial vehicles, unmanned vehicles and the like has become a popular research field, and has made significant progress.
Cluster region coverage has important application and scientific research values, such as exploration of unknown regions, monitoring of target regions, and the like. The existing cluster area coverage method lacks effective utilization of historical coverage information, and the repeated coverage problem greatly reduces the operation efficiency of an algorithm, so that the efficiency of the area coverage algorithm is improved, the maximization of a search area is realized in the shortest time, and the method is an important research direction of multi-agent cluster search control.
Deep Q-Learning is an algorithm that uses Deep neural networks to replace the Q-value tables in traditional reinforcement Learning to optimize decisions. In the cluster region coverage process of the complex environment, multiple intelligent agents can learn the state and behavior characteristics by using the deep neural network and select strategies to plan the guide points. After the Deep Q-Learning algorithm is learned, an optimal guide point planning strategy can be obtained, so that the cluster can rapidly cover the target area.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a cluster area coverage method based on Deep Q-Learning, which can realize the cluster area coverage under a free area and an obstacle area, effectively improve the cluster area coverage efficiency and ensure the stability of an algorithm under a weak communication environment.
The aim of the invention is realized by the following technical scheme: a cluster area coverage method based on Deep Q-Learning comprises the following steps:
a cluster area coverage method based on Deep Q-Learning comprises the following steps:
step S1, a dynamics model of a cluster system is established, n agents are contained in a cluster V, v= {1,2.., n }, an i-th agent in the cluster is defined as agent i, and a second-order dynamics model is defined as follows:
wherein pi Is the position of agent i, v i For the speed of agent i, u i Is an agent for the intelligent agenti acceleration, n is the total number of agents in the cluster, and />Represents p i 、v i Deriving the relative time;
step S2, determining a neighbor set of the agents in the cluster, wherein when the distance between the two agents in the cluster is smaller than the communication distance, the two agents are considered to establish communication connection and share the position and the speed, and the neighbor set of the agent i is described as follows:
N i ={j∈V:||p j -p i ||≤r α ,j≠i}
wherein V represents a set of all agents; r is (r) α Indicating the communication distance between agents, i·i is the euclidean norm, p i Is the position, p, of agent i j The position of agent j;
step S3, a motion control model of a cluster system is established, wherein alpha-agent represents an agent, beta-agent represents an obstacle detected by the agent, and gamma-agent represents a destination of the motion of the agent; respectively generating according to the alpha-agent, the beta-agent and the gamma-agentThe total motion control input of agent i is calculated as follows:
the method is used for ensuring that the clusters cannot collide with each other in the movement process;
is the obstacle avoidance control amount when the intelligent body moves in the space with the obstacle;
determining the movement direction of the intelligent body;
s4, quantifying an area to be traversed into a gamma-information map of M multiplied by L matrixes, wherein each quantized matrix center corresponds to one guide point gamma, converting complete search of the area into complete traversal of gamma points in the information map, all gamma points form a gamma-information map set of the agent i, and the agent i completes fusion updating of the information map set according to the information map set of the agent i and the information map set of a neighbor agent thereof to obtain a gamma-information map of the agent i and encodes the information map;
s5, defining a state space, a behavior space and a return function required by reinforcement learning according to the gamma-information map;
s6, designing a network model required by a Deep Q-Learning algorithm;
step S7, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6, determining a behavior selection strategy of the intelligent agent, continuously interacting with the environment through the behavior selection strategy and generating experience information, and training a Deep Q-Learning network by using the experience information;
and S8, designing a gamma point position adjustment strategy under the obstacle area, and adjusting the gamma point selected by the Deep Q-Learning network in the step S7 according to the requirement to obtain a Deep Q-Learning area coverage algorithm under the obstacle area.
In the above technical solution, in step 3:
the method is used for ensuring that clusters cannot collide with the intelligent bodies in the movement process, and is defined as follows:
wherein ,cα A constant greater than zero; to ensure that the norms are everywhere differentiable, define the sigma norms:
differentiating the sigma norm to obtain:
n ij is defined as follows:
φ α (z) is a potential energy function defined as follows:
φ α (z)=ρ h (z/r α )φ(z-d α )
wherein rα =||r a || σ ,r a Representing a communication distance between the agents; d, d α =||d|| σ D represents the ideal spacing between agents;
phi (z) is defined as follows:
wherein the following constraints are satisfied between a, b, c:
ρ h (z) is defined as follows:
ρ h (z) at phi α The design in (z) is to ensure the smoothing of the potential energy function, the potential energy is an integral process:
a ij (p) is an inter-agent adjacency matrix element defined as follows:
when the intelligent body moves in the space with the obstacle, the obstacle avoidance control quantity of the agent i is as followsThe definition is as follows:
wherein ,cβ Is a constant that is greater than zero and,set of obstacles detected for agent i, p i,k and vi,k Information indicating the position and speed of obstacle k detected by agent i, potential energy function φ β (z) is defined as follows:
wherein dβ =||d o || σ ,d o The ideal interval for the intelligent body to avoid the obstacle is shown;
the movement direction of the agent is determined as follows:
wherein , and />Proportional and differential control parameters, p, respectively, greater than zero γ The position of the guide point gamma;
in the above technical solution, the step 4 includes:
the area to be traversed is a rectangular area with M multiplied by L, the area is quantized into gamma-information maps with M multiplied by L matrixes, the center of each quantized matrix corresponds to one guide point gamma, the complete search of the area is converted into complete traversal of gamma points in the information map, and all gamma points form a gamma-information map set of agent i:
m i (γ)={γ x,y },x=1,2,...,m,y=1,2,...,l
wherein k and l are obtained by:
in the formula rs Representing the perceived radius of agent i; if agent i traverses the position of guide point gamma, then record m i (γ) =1, otherwise note m i (γ)=0;
And the agent i completes fusion updating of the information map according to the information map set of the agent i and the information map set of the neighbor agent, and an updating formula is defined as follows:
in the formula mi (γ x,y ) Gamma-information map, N, representing agent i i Set of neighbor agents for agent i, m s (γ x,y ) All gamma-information maps of neighbor agents of agent i;
the gamma-information map of the intelligent agent is encoded, and the encoding process is as follows: the information map is vectorized according to columns, 8 binary data are sequentially and continuously taken out and hexadecimal coding is carried out on the binary data, and if the last remaining number is less than 8, 0 supplementing is needed to be carried out on the missing bits; after coding, each 8 binary data corresponds to 1 hexadecimal number, and when other intelligent agents receive the hexadecimal numbers, decoding is carried out according to the inverse operation of the coding process, and the hexadecimal numbers are restored into an original information map;
in the above technical solution, the step 5 includes:
defining a state space required by reinforcement learning, and for the agent i, the state construction method comprises the following steps: firstly, fusing the information maps of the agent i and the neighbor agents according to the update formula of the gamma-information map in the step S4; secondly, assigning a weight value 3 to the gamma point position of the agent i in the information map, and assigning a weight value 2 to the gamma point position of all neighbor agents; finally, linearly stretching the fused information map into a gray map with gray values of 0 to 255, namely, 0 in the information map corresponds to gray value 0, and 3 in the information map corresponds to gray value 255;
defining a behavior space required by reinforcement learning, wherein the behavior of an agent is represented by selecting a target gamma point, and the positions of the agent in a gamma-map and 8 optional gamma points around the positions are represented by 1 to 9, and the behavior space of agent i is defined as follows:
A i ={1,2,3,4,5,6,7,8,9}
if agent i is located at the edge of the information map, the behavior space is A i Is a subset of (a); in order to accelerate the training speed, the uncovered gamma points are selected as the optional behaviors of the intelligent agent in the training, and an optional behavior space A 'is generated' i The definition is as follows:
A′ i ={γ x,y ∈A i |m i (γ x,y )=0}
the intelligent agent takes a gamma point corresponding to the behavior as a target point according to the selected behavior, and transmits the control quantity to the intelligent agent through the motion control model in the step S3 so as to enable the intelligent agent to move to the target point; in practical application, when formula |p i -γ x,y |<∈ d When meeting, the agent is determined to reach the gamma point and E d Is an allowable distance error;
defining a return function required by reinforcement learning:
in the formula γ′x,y The next gamma point selected for agent i, T is the time consumed by the gamma-information graph traversal or completion of the region coverage process, R (T) is defined as follows:
wherein ,and->Are all positive constants, r ref For maximum return, T min Is the theoretical minimum coverage time for an area, which is defined as follows:
wherein MxL represents the size of the target traversal region, mxl represents the size of the gamma-information map, v max Is the maximum movement speed of the intelligent body;
in the above technical solution, the step 6 includes:
designing a network model required by a Deep Q-learning algorithm, setting the size of a convolution kernel to be 3 or 1 in order to avoid losing characteristic information in the convolution process, setting the step length of the convolution kernels of all convolution layers to be 1, and ensuring that the size of the output characteristic of each convolution layer is the same as that of an initial image by setting a padding parameter; in order to avoid the loss of the image characteristics in the pooling process, no pooling layer exists in the network structure; according to the principle, each layer in the Q network structure is designed as follows in sequence: the input dimension is 8×8×1, and the convolution kernel size is 3×3; the dimension of the convolution layer 1 is 8 multiplied by 32, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 2 is 8 multiplied by 64, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 3 is 8 multiplied by 128, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 4 is 8 multiplied by 128, and the convolution kernel size is 1 multiplied by 1; the dimension of the full connection layer is 64 multiplied by 1; the dimension of the output layer is 9 multiplied by 1;
the activation function selected in the training process is a ReLU function, the Loss function is Huber Loss, and the definition is as follows:
wherein s is the input state of the Q network, a is the behavior selected by the agent, Q (s, a) represents the current value network output, and Q (s ', a') represents the target value network output;
in the above technical scheme, step 7.1, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of steps S5 and S6, and determining a behavior selection policy of the agent: according to the optional behavior space A 'defined in the step S5' i Whether or not it is an empty set, the behavior selection of agent i can be divided into two cases:
in one of the two ways,in the training process, in order to enable all states to be trained, an epsilon-greedy method is adopted to select behaviors, and the behavior selection is determined by the following formula:
wherein rand (1) represents a random number taken in (0, 1)Number of machines, Q i (s i ,a i ) For the output of the target value network of agent i, function f sample Representing the slave selectable behavior space A' i Epsilon is an exploration variable defined as follows:
epsilon in the formula start and εend Respectively represent the initial value and the final value of epsilon, sigma ESP As an attenuation factor, episode_num is a round parameter episode in training, and epsilon is constantly 0 in the test process after training is completed;
and two, a second step of, in the second step,agent i can not reach the uncovered area no matter what action is selected, at this time, the uncovered gamma point closest to the point is selected, and a shortest path is selected, where the definition of action selection is as follows:
step 7.2, giving a Deep Q-Learning-based region coverage algorithm under a free region, enabling an intelligent agent to continuously interact with the environment through a behavior selection strategy and generate experience information, and training a Deep Q-Learning network by using the experience information;
the Deep Q-Learning based region coverage algorithm under the free region comprises the following steps:
step 7.2.1 initializing parameters λ, σ of Deep Q-Learning ESP ,r ref ,ε start ,ε end Control parameters of an agentParameter r of sensor s D, parameters m, n of the information map, capacity C of experience pool D max Batch parameter batch_size, network parameter update periodN TU ;
Step 7.2.2, initializing an information map for all agents, initializing a current value and a target value network model, and updating parameters of the target value network model by adopting the following steps:
step 7.2.2.1 traversing all rounds of numbers episode=1→n T Performing:
step 7.2.2.1.1, initializing the position and speed information of each agent;
step 7.2.2.1.2, initializing the state s, behavior a and gamma points of each agent;
step 7.2.2.1.3, if the current round does not complete the coverage, executing:
step 7.2.2.1.3.1, traversing all agents i=1→n:
step 7.2.2.1.3.2, the agent updates u according to the motion control model i ,v i and pi ;
Step 7.2.2.1.3.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i Updating state s i :=s′ i ;
Step 7.2.2.1.3.4, sample data (s i ,a i ,s′ i ,r i ) Storing the data into an experience pool D;
step 7.2.2.1.3.5, determining behavior a according to the behavior selection policy i And will act a i Converting into a corresponding gamma point;
step 7.2.2.1.3.6 if the number of samples in the experience pool is greater than the batch_size, randomly selecting batch_size data from D (s i ,a i ,s′ i ,r i ) Training the current value network with the selected sample as a sample, otherwise performing step 7.2.2.1.3.1;
step 7.2.2.1 if mod (epi-code, N TU ) = 0, copying parameters of the current value network to the target value network;
in the above technical solution, the step 8 includes:
step 8.1, adjusting the gamma point selected by the Deep Q-Learning network in step 7 according to the requirement to obtain Deep Q-Lear in the obstructed areaDetermining a gamma point position adjustment strategy under the obstacle area by using a ning area coverage algorithm: in the case that the gamma point is covered by the obstacle, the covering of the gamma point is not needed to be considered, and m is i (γ x,y ) =1; for the situation that the obstacle approaches the gamma point, if the distance between the obstacle and the gamma point is smaller than the obstacle avoidance distance d of the intelligent body o The position of the gamma point needs to be adjusted, and the grid area where the gamma point is positioned is taken as a new guide point gamma o Is denoted as M obs ,γ o The point is determined by the following equation:
in the formula A set of obstacles detected for agent i; if->Let m be i (γ) =1; if->Then a point of maximum coverage is selected according to the following equation:
in the formula D1 Is the optimization area M obs Obstacle region D in (a) 2 Is the optimization area M obs In unobstructed areas, gamma' x,y Is a point coordinate that can replace the gamma point,is the optimal point for replacing gamma point;
the regional coverage under the obstacle region is similar to the regional coverage under the free region, when the distance between the obstacle and the gamma point is too short, the intelligent agent can realize the maximum coverage of the region by adjusting the position of the gamma point, so that if the Deep Q-Learning model under the free region is obtained through training, the Deep Q-Learning model under the obstacle region does not need to be trained;
step 8.2, a Deep Q-Learning based region coverage algorithm under the obstructed region comprises the following steps:
step 8.2.1, initializing the control parameter c of the agent α ,c β Etc., parameter r of sensor s D, etc., parameters m, n, etc. of the information map;
step 8.2.2, initializing an obstacle area, and initializing an information map for all agents;
step 8.2.3, initializing the position and speed information of each agent;
step 8.2.4, initializing the state s and the behavior a and gamma points of each agent;
step 8.2.5, loading a network model which is trained and completed under a free area;
step 8.2.6, if the current round does not complete the coverage, executing:
step 8.2.6.1, traversing all agents i=1→n:
step 8.2.6.2, the agent updates u according to the motion control model i ,v i and pi ;
Step 8.2.6.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i Updating state s i :=s′ i ;
Step 8.2.6.4, determining behavior a according to the behavior selection policy i Converting into a corresponding gamma point;
step 8.2.6.5 calculating the minimum obstacle avoidance distance d between the obstacle and the gamma point γo ;
Step 8.2.6.6, if d γo >d o The gamma point position is adjusted by adopting the following steps, otherwise, the step 8.2.6.1 is executed;
step 8.2.6.6.1, build M obs And calculating gamma o ,γ o Is a new guide point;
step 8.2.6.6.2 ifObtaining new +.>Point, otherwise let m i (γ)=1。
Because the invention adopts the technical scheme, the invention has the following beneficial effects:
the invention trains and learns the cluster area coverage control algorithm by means of Deep Q-Learning technology, realizes the cluster area coverage under the free area and the area with the obstacle, effectively improves the cluster area coverage efficiency, and can ensure the stability of the algorithm under the weak communication environment.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of an observation state construction process;
Detailed Description
The technical solution of the present invention will be described in further detail with reference to the accompanying drawings, but the scope of the present invention is not limited to the following description.
As shown in fig. 1, a Deep Q-Learning based cluster area coverage method includes the following steps:
a cluster area coverage method based on Deep Q-Learning comprises the following steps:
step S1, a dynamics model of a cluster system is established, n agents are contained in a cluster V, v= {1,2.., n }, an i-th agent in the cluster is defined as agent i, and a second-order dynamics model is defined as follows:
wherein pi Is the position of agent i, v i For the speed of agent i, u i Adding for agent iSpeed, n is the total number of agents in the cluster, and />Represents p i 、v i Deriving the relative time;
step S2, determining a neighbor set of the agents in the cluster, wherein when the distance between the two agents in the cluster is smaller than the communication distance, the two agents are considered to establish communication connection and share the position and the speed, and the neighbor set of the agent i is described as follows:
N i ={j∈V:||p j -p i ||≤r α ,j≠i}
wherein V represents a set of all agents; r is (r) α Indicating the communication distance between agents, i·i is the euclidean norm, p i Is the position, p, of agent i j The position of agent j;
step S3, a motion control model of a cluster system is established, wherein alpha-agent represents an agent, beta-agent represents an obstacle detected by the agent, and gamma-agent represents a destination of the motion of the agent; respectively generating according to the alpha-agent, the beta-agent and the gamma-agentThe total motion control input of agent i is calculated as follows:
the method is used for ensuring that the clusters cannot collide with each other in the movement process;
is the obstacle avoidance control amount when the intelligent body moves in the space with the obstacle;
determining the movement direction of the intelligent body;
s4, constructing an information map and encoding the information map;
step S5, defining a state space, a behavior space and a return function required by reinforcement learning according to the information map;
s6, designing a network model required by a Deep Q-Learning algorithm;
step S7, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6;
and S8, adjusting the gamma point obtained in the step S7 as required to obtain a Deep Q-Learning region coverage algorithm under the obstructed region.
In the above technical solution, in step 3:
the method is used for ensuring that clusters cannot collide with the intelligent bodies in the movement process, and is defined as follows:
wherein ,cα A constant greater than zero; to ensure that the norms are everywhere differentiable, define the sigma norms:
differentiating the sigma norm to obtain:
n ij is defined as follows:
φ α (z) is a potential energy function defined as follows:
φ α (z)=ρ h (z/r α )φ(z-d α )
wherein rα =||r a || σ ,r a Representing a communication distance between the agents; d, d α =||d|| σ D represents the ideal spacing between agents;
phi (z) is defined as follows:
wherein the following constraints are satisfied between a, b, c:
ρ h (z) is defined as follows:
ρ h (z) at phi α The design in (z) is to ensure the smoothing of the potential energy function, the potential energy is an integral process:
a ij (p) is an inter-agent adjacency matrix element defined as follows:
when the intelligent body moves in the space with the obstacle, the obstacle avoidance control quantity of the agent i is as followsThe definition is as follows:
wherein ,cβ Is a constant that is greater than zero and,set of obstacles detected for agent i, p i,k and vi,k Information indicating the position and speed of obstacle k detected by agent i, potential energy function φ β (z) is defined as follows:
wherein dβ =||d o || σ ,d o The ideal interval for the intelligent body to avoid the obstacle is shown;
the movement direction of the agent is determined as follows:
wherein , and />Proportional and differential control parameters, p, respectively, greater than zero γ The position of the guide point gamma;
in the above technical solution, the step 4 includes:
the area to be traversed is a rectangular area with M multiplied by L, the area is quantized into gamma-information maps with M multiplied by L matrixes, the center of each quantized matrix corresponds to one guide point gamma, the complete search of the area is converted into complete traversal of gamma points in the information map, and all gamma points form a gamma-information map set of agent i:
m i (γ)={γ x,y },x=1,2,...,m,y=1,2,...,l
wherein k and l are obtained by:
in the formula rs Representing the perceived radius of agent i; if agent i traverses the position of guide point gamma, then record m i (γ) =1, otherwise note m i (γ)=0;
And the agent i completes fusion updating of the information map according to the information map set of the agent i and the information map set of the neighbor agent, and an updating formula is defined as follows:
in the formula mi (γ x,y ) Gamma-information map, N, representing agent i i Set of neighbor agents for agent i, m s (γ x,y ) All gamma-information maps of neighbor agents of agent i;
the gamma-information map of the intelligent agent is encoded, and the encoding process is as follows: the information map is vectorized according to columns, 8 binary data are sequentially and continuously taken out and hexadecimal coding is carried out on the binary data, and if the last remaining number is less than 8, 0 supplementing is needed to be carried out on the missing bits; after coding, each 8 binary data corresponds to 1 hexadecimal number, and when other intelligent agents receive the hexadecimal numbers, decoding is carried out according to the inverse operation of the coding process, and the hexadecimal numbers are restored into an original information map;
in the above technical solution, the step 5 includes:
as shown in fig. 2, a state space required for reinforcement learning is defined, and for agent i, the state construction method is as follows: firstly, fusing the information maps of the agent i and the neighbor agents according to the update formula of the gamma-information map in the step S4; secondly, assigning a weight value 3 to the gamma point position of the agent i in the information map, and assigning a weight value 2 to the gamma point position of all neighbor agents; finally, linearly stretching the fused information map into a gray map with gray values of 0 to 255, namely, 0 in the information map corresponds to gray value 0, and 3 in the information map corresponds to gray value 255;
defining a behavior space required by reinforcement learning, wherein the behavior of an agent is represented by selecting a target gamma point, and the positions of the agent in a gamma-map and 8 optional gamma points around the positions are represented by 1 to 9, and the behavior space of agent i is defined as follows:
A i ={1,2,3,4,5,6,7,8,9}
if agent i is located at the edge of the information map, the behavior space is A i Is a subset of (a); in order to accelerate the training speed, the uncovered gamma points are selected as the optional behaviors of the intelligent agent in the training, and an optional behavior space A 'is generated' i The definition is as follows:
A′ i ={γ x,y ∈A i |m i (γ x,y )=0}
the intelligent agent takes a gamma point corresponding to the behavior as a target point according to the selected behavior, and transmits the control quantity to the intelligent agent through the motion control model in the step S3 so as to enable the intelligent agent to move to the target point; in practical application, when formula |p i -γ x,y |<∈ d When meeting, the agent is determined to reach the gamma point and E d Is an allowable distance error;
defining a return function required by reinforcement learning:
in the formula γ′x,y The next gamma point selected for agent i, T is the time consumed by the gamma-information graph traversal or completion of the region coverage process, R (T) is defined as follows:
wherein ,and->Are all positive constants, r ref For maximum return, T min Is the theoretical minimum coverage time for an area, which is defined as follows:
wherein MxL represents the size of the target traversal region, mxl represents the size of the gamma-information map, v max Is the maximum movement speed of the intelligent body;
in the above technical solution, the step 6 includes:
designing a network model required by a Deep Q-learning algorithm, setting the size of a convolution kernel to be 3 or 1 in order to avoid losing characteristic information in the convolution process, setting the step length of the convolution kernels of all convolution layers to be 1, and ensuring that the size of the output characteristic of each convolution layer is the same as that of an initial image by setting a padding parameter; in order to avoid the loss of the image characteristics in the pooling process, no pooling layer exists in the network structure; according to the principle, each layer in the Q network structure is designed as follows in sequence: the input dimension is 8×8×1, and the convolution kernel size is 3×3; the dimension of the convolution layer 1 is 8 multiplied by 32, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 2 is 8 multiplied by 64, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 3 is 8 multiplied by 128, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 4 is 8 multiplied by 128, and the convolution kernel size is 1 multiplied by 1; the dimension of the full connection layer is 64 multiplied by 1; the dimension of the output layer is 9 multiplied by 1;
the activation function selected in the training process is a ReLU function, the Loss function is Huber Loss, and the definition is as follows:
wherein s is the input state of the Q network, a is the behavior selected by the agent, Q (s, a) represents the current value network output, and Q (s ', a') represents the target value network output;
in the above technical solution, the step 7 includes:
step 7.1, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6, and determining a behavior selection strategy of the intelligent agent: according to the optional behavior space A 'defined in the step S5' i Whether or not it is an empty set, the behavior selection of agent i can be divided into two cases:
in one of the two ways,in order to enable all states to be trained, the epsilon-greedy method is used to select behaviors, which are determined by the following formula: />
Wherein rand (1) represents a random number, Q, taken in (0, 1) i (s i ,a i ) For the output of the target value network of agent i, function f sample Representing the slave selectable behavior space A' i Epsilon is an exploration variable and is defined as follows:
epsilon in the formula start and εend Respectively represent the initial value and the final value of epsilon, sigma ESP As an attenuation factor, the epoode_num is a round parameter epoode in training;
and two, a second step of, in the second step,agent i can not reach the uncovered area no matter what action is selected, at this time, the uncovered gamma point closest to the point is selected, and a shortest path is selected, where the definition of action selection is as follows:
step 7.2, determining a behavior a according to a behavior selection strategy by giving a Deep Q-Learning-based region coverage algorithm under a free region i Converting into a corresponding gamma point, and copying parameters of a current value network to a target value network;
the Deep Q-Learning based region coverage algorithm under the free region comprises the following steps:
step 7.2.1 initializing parameters λ, σ of Deep Q-Learning ESP ,r ref ,ε start ,ε end Control parameters of an agentParameter r of sensor s D, parameters m, n of the information map, capacity C of experience pool D max Batch parameter batch_size;
step 7.2.2, initializing an information map for all agents, initializing a current value and a target value network model, and updating parameters of the target value network model by adopting the following steps:
step 7.2.2.1 traversing all rounds of numbers episode=1→n T Performing:
step 7.2.2.1.1, initializing the position and speed information of each agent;
step 7.2.2.1.2, initializing the state s and behavior a of each agent;
step 7.2.2.1.3, if the current round number is less than or equal to the total round number N T Performing:
step 7.2.2.1.3.1, traversing all agents i=1→n:
step 7.2.2.1.3.2, calculating u according to the motion control model of the agent i ,v i and pi ;
Step 7.2.2.1.3.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i ;
Step 7.2.2.1.3.4, sample data (s i ,a i ,s′ i ,r i ) Storing the data into an experience pool D; the method comprises the steps of carrying out a first treatment on the surface of the
Step 7.2.2.1.3.5, determining behavior a according to the behavior selection policy i And will act a i Converted into corresponding gamma point, and the intelligent agent is according to u i ,v i and pi Move to the gamma point and update the state s after the movement is finished i :=s′ i ;
Step 7.2.2.1.3.6 if the number of samples in the experience pool is greater than the batch_size, randomly selecting batch_size data from D (s i ,a i ,s′ i ,r i ) Training the current value network with the selected sample as a sample, otherwise performing step 7.2.2.1.3.2;
step 7.2.2.1 if mod (epi-code, N TU ) = 0, copy parameters of the current value network to the target value network, N TU Is an update period;
in the above technical solution, the step 8 includes:
and (3) adjusting the gamma point obtained in the step (S7) as required to obtain a Deep Q-Learning region coverage algorithm under the obstacle region, and determining a gamma point position adjusting method: in the case that the gamma point is covered by the obstacle, the covering of the gamma point is not needed to be considered, and m is i (γ x,y ) =1; for the situation that the obstacle approaches the gamma point, if the distance between the obstacle and the gamma point is smaller than the obstacle avoidance distance d of the intelligent body o The position of the gamma point needs to be adjusted, and the grid area where the gamma point is positioned is taken as a new guide point gamma o Is denoted as M obs ,γ o The point is determined by the following equation:
in the formula A set of obstacles detected for agent i; if->Let m be i (γ) =1; if->Then a point of maximum coverage is selected according to the following equation:
in the formula D1 Is the optimization area M obs Obstacle region D in (a) 2 Is the optimization area M obs In unobstructed areas, gamma' x,y Is a point coordinate that can replace the gamma point,is the optimal point for replacing gamma point;
the regional coverage under the obstacle region is similar to the regional coverage under the free region, when the distance between the obstacle and the gamma point is too short, the intelligent agent can realize the maximum coverage of the region by adjusting the position of the gamma point, so that if the Deep Q-Learning model under the free region is obtained through training, the Deep Q-Learning model under the obstacle region does not need to be trained;
based on the steps S1 to S8, a Deep Q-Learning based area coverage algorithm under the obstacle area is given as shown in table-2:
TABLE-2 Deep Q-Learning based region coverage algorithm under obstructed regions
/>
The invention realizes the training and Learning of the cluster region coverage control algorithm by means of Deep Q-Learning technology, realizes the cluster region coverage under the free region and the obstacle region, effectively improves the cluster region coverage efficiency, and can ensure the stability of the algorithm under the weak communication environment.
The foregoing is a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein, but is not to be construed as limited to other embodiments, but is capable of other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept, either as a result of the foregoing teachings or as a result of the knowledge or knowledge of the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.
Claims (7)
1. The cluster area coverage method based on Deep Q-Learning is characterized by comprising the following steps of:
step S1, a dynamics model of a cluster system is established, n agents are contained in a cluster V, v= {1,2.., n }, an i-th agent in the cluster is defined as agent i, and a second-order dynamics model is defined as follows:
wherein pi Is the position of agent i, v i For the speed of agent i, u i Acceleration for agent i, n is the total number of agents in the cluster, and />Represents p i 、v i Deriving the relative time;
step S2, determining a neighbor set of the agents in the cluster, wherein when the distance between the two agents in the cluster is smaller than the communication distance, the two agents are considered to establish communication connection and share the position and the speed, and the neighbor set of the agent i is described as follows:
N i ={j∈V:||p j -p i ||≤r α ,j≠i}
wherein V represents a set of all agents; r is (r) α Indicating the communication distance between agents, i·i is the euclidean norm, p i Is the position, p, of agent i j The position of agent j;
step S3, a motion control model of a cluster system is established, wherein alpha-agent represents an agent, beta-agent represents an obstacle detected by the agent, and gamma-agent represents a destination of the motion of the agent; respectively generating according to the alpha-agent, the beta-agent and the gamma-agentThe total motion control input of agent i is calculated as follows:
the method is used for ensuring that the clusters cannot collide with each other in the movement process;
is the obstacle avoidance control amount when the intelligent body moves in the space with the obstacle;
determining the movement direction of the intelligent body;
s4, quantifying an area to be traversed into a gamma-information map of M multiplied by L matrixes, wherein each quantized matrix center corresponds to one guide point gamma, converting complete search of the area into complete traversal of gamma points in the information map, all gamma points form a gamma-information map set of the agent i, and the agent i completes fusion updating of the information map set according to the information map set of the agent i and the information map set of a neighbor agent thereof to obtain a gamma-information map of the agent i and encodes the information map;
s5, defining a state space, a behavior space and a return function required by reinforcement learning according to the gamma-information map;
s6, designing a network model required by a Deep Q-Learning algorithm;
step S7, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6, determining a behavior selection strategy of the intelligent agent, continuously interacting with the environment through the behavior selection strategy and generating experience information, and training a Deep Q-Learning network by using the experience information;
and S8, designing a gamma point position adjustment strategy under the obstacle area, and adjusting the gamma point selected by the Deep Q-Learning network in the step S7 according to the requirement to obtain a Deep Q-Learning area coverage algorithm under the obstacle area.
2. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: in step 3:
the method is used for ensuring that clusters cannot collide with the intelligent bodies in the movement process, and is defined as follows:
wherein ,ca A constant greater than zero; to ensure that the norms are everywhere differentiable, define the sigma norms:
differentiating the sigma norm to obtain:
n ij is defined as follows:
φ α (z) is a potential energy function defined as follows:
φ α (z)=ρ h (z/r a )φ(z-d α )
wherein rα =||r a || σ ,r a Representing a communication distance between the agents; d, d α =||d|| σ D represents the ideal spacing between agents;
phi (z) is defined as follows:
wherein the following constraints are satisfied between a, b, c:
ρ h (z) is defined as follows:
ρ h (z) at phi α The design in (z) is to ensure the smoothing of the potential energy function, the potential energy is an integral process:
a ij (p) is an inter-agent adjacency matrix element defined as follows:
when the intelligent body moves in the space with the obstacle, the obstacle avoidance control quantity of the agent i is as followsThe definition is as follows:
wherein ,cβ Is a constant that is greater than zero and,set of obstacles detected for agent i, p i,k and vi,k Information representing the position and velocity of obstacle k detected by agent i, potential energy function phi β (z) is defined as follows:
wherein dβ =||d o || σ ,d o The ideal interval for the intelligent body to avoid the obstacle is shown;
the movement direction of the agent is determined as follows:
wherein , and />Proportional and differential control parameters, p, respectively, greater than zero γ The position of the guide point γ is indicated.
3. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 4 comprises the following steps:
the area to be traversed is a rectangular area with M multiplied by L, the area is quantized into gamma-information maps with M multiplied by L matrixes, the center of each quantized matrix corresponds to one guide point gamma, the complete search of the area is converted into complete traversal of gamma points in the information map, and all gamma points form a gamma-information map set of agent i:
m i (y)={γ x,y },x=1,2,...,m,y=1,2,...,l
wherein k and l are obtained by:
in the formula rs Representing the perceived radius of agent i; if agent i traverses the position of guide point gamma, then record m i (γ) =1, otherwise note m i (γ)=0;
And the agent i completes fusion updating of the information map according to the information map set of the agent i and the information map set of the neighbor agent, and an updating formula is defined as follows:
in the formula mi (γ x,y ) Gamma-information map, N, representing agent i i Set of neighbor agents for agent i, m s (γ x,y ) All gamma-information maps of neighbor agents of agent i;
the gamma-information map of the intelligent agent is encoded, and the encoding process is as follows: the information map is vectorized according to columns, 8 binary data are sequentially and continuously taken out and hexadecimal coding is carried out on the binary data, and if the last remaining number is less than 8, 0 supplementing is needed to be carried out on the missing bits; after coding, every 8 binary data corresponds to 1 hexadecimal number, and when other intelligent agents receive the hexadecimal number, decoding is carried out according to the inverse operation of the coding process, and the hexadecimal number is restored into the original information map.
4. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 5 comprises the following steps:
defining a state space required by reinforcement learning, and for the agent i, the state construction method comprises the following steps: firstly, fusing the information maps of the agent i and the neighbor agents according to the update formula of the gamma-information map in the step S4; secondly, assigning a weight value 3 to the gamma point position of the agent i in the information map, and assigning a weight value 2 to the gamma point position of all neighbor agents; finally, linearly stretching the fused information map into a gray map with gray values of 0 to 255, namely, 0 in the information map corresponds to gray value 0, and 3 in the information map corresponds to gray value 255;
defining a behavior space required by reinforcement learning, wherein the behavior of an agent is represented by selecting a target gamma point, and the positions of the agent in a gamma-map and 8 optional gamma points around the positions are represented by 1 to 9, and the behavior space of agent i is defined as follows:
A i ={1,2,3,4,5,6,7,8,9}
if agent i is located at the edge of the information map, the behavior space is A i Is a subset of (a); in order to accelerate the training speed, the uncovered gamma points are selected as the optional behaviors of the intelligent agent in the training, and an optional behavior space A 'is generated' i The definition is as follows:
A′ i ={γ x,y ∈A i |m i (γ x,y )=0}
the intelligent agent takes a gamma point corresponding to the behavior as a target point according to the selected behavior, and transmits the control quantity to the intelligent agent through the motion control model in the step S3 so as to enable the intelligent agent to move to the target point; in practical application, when formula |p i -γ x,y |<∈ d When meeting, the agent is determined to reach the gamma point and E d Is an allowable distance error;
defining a return function required by reinforcement learning:
in the formula γ′x,y The next gamma point selected for agent i, T is the time consumed by the gamma-information graph traversal or completion of the region coverage process, R (T) is defined as follows:
wherein ,and->Are all positive constants, r ref For maximum return, T min Is the theoretical minimum coverage time for an area, which is defined as follows:
wherein MxL represents the size of the target traversal region, mxl represents the size of the gamma-information map, v max Is the maximum movement speed of the agent.
5. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 6 comprises the following steps:
designing a network model required by a Deep Q-learning algorithm, setting the size of a convolution kernel to be 3 or 1 in order to avoid losing characteristic information in the convolution process, setting the step length of the convolution kernels of all convolution layers to be 1, and ensuring that the size of the output characteristic of each convolution layer is the same as that of an initial image by setting a padding parameter; in order to avoid the loss of the image characteristics in the pooling process, no pooling layer exists in the network structure; according to the principle, each layer in the Q network structure is designed as follows in sequence: the input dimension is 8×8×1, and the convolution kernel size is 3×3; the dimension of the convolution layer 1 is 8 multiplied by 32, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 2 is 8 multiplied by 64, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 3 is 8 multiplied by 128, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 4 is 8 multiplied by 128, and the convolution kernel size is 1 multiplied by 1; the dimension of the full connection layer is 64 multiplied by 1; the dimension of the output layer is 9 multiplied by 1;
the activation function selected in the training process is a ReLU function, the Loss function is Huber Loss, and the definition is as follows:
where s is the input state of the Q network, a is the behavior selected by the agent, Q (s, a) represents the current value network output, and Q (s ', a') represents the target value network output.
6. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein:
step 7.1, designing the free region based on the results of steps S5 and S6The Deep Q-Learning region coverage algorithm of (2) determines a behavior selection strategy of the agent: according to the optional behavior space A 'defined in the step S5' i Whether or not it is an empty set, the behavior selection of agent i can be divided into two cases:
in one of the two ways,in the training process, in order to enable all states to be trained, an epsilon-greedy method is adopted to select behaviors, and the behavior selection is determined by the following formula:
wherein rand (1) represents a random number, Q, taken in (0, 1) i (s i ,a i ) For the output of the target value network of agent i, function f sample Representing the slave selectable behavior space A' i Epsilon is an exploration variable defined as follows:
epsilon in the formula start and εend Respectively represent the initial value and the final value of epsilon, sigma ESP As an attenuation factor, episode_num is a round parameter episode in training, and epsilon is constantly 0 in the test process after training is completed;
and two, a second step of, in the second step,agent i can not reach the uncovered area no matter what action is selected, at this time, the uncovered gamma point closest to the point is selected, and a shortest path is selected, where the definition of action selection is as follows:
step 7.2, giving a Deep Q-Learning-based region coverage algorithm under a free region, enabling an intelligent agent to continuously interact with the environment through a behavior selection strategy and generate experience information, and training a Deep Q-Learning network by using the experience information;
the Deep Q-Learning based region coverage algorithm under the free region comprises the following steps:
step 7.2.1 initializing parameters λ, σ of Deep Q-Learning ESP ,r ref ,ε start ,ε end Control parameter c of agent a ,Parameter r of sensor s D, parameters m, n of the information map, capacity C of experience pool D max Batch parameter batch_size, network parameter update period N TU ;
Step 7.2.2, initializing an information map for all agents, initializing a current value and a target value network model, and updating parameters of the target value network model by adopting the following steps:
step 7.2.2.1 traversing all rounds of numbers episode=1→n T Performing:
step 7.2.2.1.1, initializing the position and speed information of each agent;
step 7.2.2.1.2, initializing the state s, behavior a and gamma points of each agent;
step 7.2.2.1.3, if the current round does not complete the coverage, executing:
step 7.2.2.1.3.1, traversing all agents i=1→n:
step 7.2.2.1.3.2, the agent updates u according to the motion control model i ,v i and pi ;
Step 7.2.2.1.3.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i Updating state s i :=s′ i ;
Step 7.2.2.1.3.4, sample data (s i ,a i ,s′ i ,r i ) Storing the data into an experience pool D;
step 7.2.2.1.3.5, determining behavior a according to the behavior selection policy i And will act a i Converting into a corresponding gamma point;
step 7.2.2.1.3.6 if the number of samples in the experience pool is greater than the batch_size, randomly selecting batch_size data from D (s i ,a i ,s′ i ,r i ) Training the current value network with the selected sample as a sample, otherwise performing step 7.2.2.1.3.1;
step 7.2.2.1 if mod (epi-code, N TU ) = 0, copy parameters of the current value network to the target value network.
7. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 8 includes:
step 8.1, adjusting the gamma point selected by the Deep Q-Learning network in step S7 according to the requirement to obtain a Deep Q-Learning area coverage algorithm under the obstacle area, and determining a gamma point position adjustment strategy under the obstacle area: in the case that the gamma point is covered by the obstacle, the covering of the gamma point is not needed to be considered, and m is i (γ x,y ) =1; for the situation that the obstacle approaches the gamma point, if the distance between the obstacle and the gamma point is smaller than the obstacle avoidance distance d of the intelligent body o The position of the gamma point needs to be adjusted, and the grid area where the gamma point is positioned is taken as a new guide point gamma o Is denoted as M obs ,γ o The point is determined by the following equation:
in the formula A set of obstacles detected for agent i; if->Let m be i (γ) =1; if->Then a point of maximum coverage is selected according to the following equation:
in the formula D1 Is the optimization area M obs Obstacle region D in (a) 2 Is the optimization area M obs In unobstructed areas, gamma' x,y Is a point coordinate that can replace the gamma point,is the optimal point for replacing gamma point;
the regional coverage under the obstacle region is similar to the regional coverage under the free region, when the distance between the obstacle and the gamma point is too short, the intelligent agent can realize the maximum coverage of the region by adjusting the position of the gamma point, so that if the Deep Q-Learning model under the free region is obtained through training, the Deep Q-Learning model under the obstacle region does not need to be trained;
step 8.2, a Deep Q-Learning based region coverage algorithm under the obstructed region comprises the following steps:
step 8.2.1, initializing the control parameter c of the agent a ,c β ,Etc., parameter r of sensor s D, etc., parameters m, n, etc. of the information map;
step 8.2.2, initializing an obstacle area, and initializing an information map for all agents;
step 8.2.3, initializing the position and speed information of each agent;
step 8.2.4, initializing the state s and the behavior a and gamma points of each agent;
step 8.2.5, loading a network model which is trained and completed under a free area;
step 8.2.6, if the current round does not complete the coverage, executing:
step 8.2.6.1, traversing all agents i=1→n:
step 8.2.6.2, the agent updates u according to the motion control model i ,v i and pi ;
Step 8.2.6.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i Updating state s i :=s′ i ;
Step 8.2.6.4, determining behavior a according to the behavior selection policy i Converting into a corresponding gamma point;
step 8.2.6.5 calculating the minimum obstacle avoidance distance d between the obstacle and the gamma point γo ;
Step 8.2.6.6, if d γo <d o The gamma point position is adjusted by adopting the following steps, otherwise, the step 8.2.6.1 is executed;
step 8.2.6.6.1, build M obs And calculating gamma o ,γ o Is a new guide point;
step 8.2.6.6.2 ifObtaining new +.>Point, otherwise let m i (γ)=1。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210026133.0A CN114326749B (en) | 2022-01-11 | 2022-01-11 | Deep Q-Learning-based cluster area coverage method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210026133.0A CN114326749B (en) | 2022-01-11 | 2022-01-11 | Deep Q-Learning-based cluster area coverage method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114326749A CN114326749A (en) | 2022-04-12 |
CN114326749B true CN114326749B (en) | 2023-10-13 |
Family
ID=81026417
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210026133.0A Active CN114326749B (en) | 2022-01-11 | 2022-01-11 | Deep Q-Learning-based cluster area coverage method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114326749B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006072477A (en) * | 2004-08-31 | 2006-03-16 | Nippon Telegr & Teleph Corp <Ntt> | Dialogue strategy learning method, program, and device, and storage medium |
CN111880565A (en) * | 2020-07-22 | 2020-11-03 | 电子科技大学 | Q-Learning-based cluster cooperative countermeasure method |
CN111880564A (en) * | 2020-07-22 | 2020-11-03 | 电子科技大学 | Multi-agent area searching method based on collaborative reinforcement learning |
CN113110478A (en) * | 2021-04-27 | 2021-07-13 | 广东工业大学 | Method, system and storage medium for multi-robot motion planning |
CN113156954A (en) * | 2021-04-25 | 2021-07-23 | 电子科技大学 | Multi-agent cluster obstacle avoidance method based on reinforcement learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2557674B (en) * | 2016-12-15 | 2021-04-21 | Samsung Electronics Co Ltd | Automated Computer Power Management System, Apparatus and Methods |
-
2022
- 2022-01-11 CN CN202210026133.0A patent/CN114326749B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006072477A (en) * | 2004-08-31 | 2006-03-16 | Nippon Telegr & Teleph Corp <Ntt> | Dialogue strategy learning method, program, and device, and storage medium |
CN111880565A (en) * | 2020-07-22 | 2020-11-03 | 电子科技大学 | Q-Learning-based cluster cooperative countermeasure method |
CN111880564A (en) * | 2020-07-22 | 2020-11-03 | 电子科技大学 | Multi-agent area searching method based on collaborative reinforcement learning |
CN113156954A (en) * | 2021-04-25 | 2021-07-23 | 电子科技大学 | Multi-agent cluster obstacle avoidance method based on reinforcement learning |
CN113110478A (en) * | 2021-04-27 | 2021-07-13 | 广东工业大学 | Method, system and storage medium for multi-robot motion planning |
Non-Patent Citations (1)
Title |
---|
基于深度Q网络学习的机器人端到端控制方法;张浩杰;苏治宝;苏波;;仪器仪表学报(第10期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114326749A (en) | 2022-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lamini et al. | Genetic algorithm based approach for autonomous mobile robot path planning | |
CN110991972B (en) | Cargo transportation system based on multi-agent reinforcement learning | |
CN110347155B (en) | Intelligent vehicle automatic driving control method and system | |
CN112819253A (en) | Unmanned aerial vehicle obstacle avoidance and path planning device and method | |
Hagras et al. | Learning and adaptation of an intelligent mobile robot navigator operating in unstructured environment based on a novel online Fuzzy–Genetic system | |
CN111260026B (en) | Navigation migration method based on meta reinforcement learning | |
CN111880564A (en) | Multi-agent area searching method based on collaborative reinforcement learning | |
CN113299084B (en) | Regional signal lamp cooperative control method based on multi-view coding migration reinforcement learning | |
CN116382267B (en) | Robot dynamic obstacle avoidance method based on multi-mode pulse neural network | |
CN109540163A (en) | A kind of obstacle-avoiding route planning algorithm combined based on differential evolution and fuzzy control | |
Janikow | A genetic algorithm method for optimizing the fuzzy component of a fuzzy decision tree | |
Showalter et al. | Neuromodulated multiobjective evolutionary neurocontrollers without speciation | |
Thabet et al. | Sample-efficient deep reinforcement learning with imaginary rollouts for human-robot interaction | |
CN113110052A (en) | Hybrid energy management method based on neural network and reinforcement learning | |
CN114326749B (en) | Deep Q-Learning-based cluster area coverage method | |
CN116080688B (en) | Brain-inspiring-like intelligent driving vision assisting method, device and storage medium | |
Lee et al. | A genetic algorithm based robust learning credit assignment cerebellar model articulation controller | |
Showalter et al. | Lamarckian inheritance in neuromodulated multiobjective evolutionary neurocontrollers | |
CN116300755A (en) | Double-layer optimal scheduling method and device for heat storage-containing heating system based on MPC | |
CN109978133A (en) | A kind of intensified learning moving method based on action mode | |
CN114859719A (en) | Graph neural network-based reinforcement learning cluster bee-congestion control method | |
Showalter et al. | Objective comparison and selection in mono-and multi-objective evolutionary neurocontrollers | |
Barto | An approach to learning control surfaces by connectionist systems | |
Butz et al. | REPRISE: A Retrospective and Prospective Inference Scheme. | |
CN115202339B (en) | DQN-based multi-moon vehicle sampling fixed target self-adaptive planning method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |