CN114326749B - Deep Q-Learning-based cluster area coverage method - Google Patents

Deep Q-Learning-based cluster area coverage method Download PDF

Info

Publication number
CN114326749B
CN114326749B CN202210026133.0A CN202210026133A CN114326749B CN 114326749 B CN114326749 B CN 114326749B CN 202210026133 A CN202210026133 A CN 202210026133A CN 114326749 B CN114326749 B CN 114326749B
Authority
CN
China
Prior art keywords
agent
gamma
learning
deep
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210026133.0A
Other languages
Chinese (zh)
Other versions
CN114326749A (en
Inventor
袁国慧
王卓然
肖剑
何劲辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Yangtze River Delta Research Institute of UESTC Huzhou
Priority to CN202210026133.0A priority Critical patent/CN114326749B/en
Publication of CN114326749A publication Critical patent/CN114326749A/en
Application granted granted Critical
Publication of CN114326749B publication Critical patent/CN114326749B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Feedback Control In General (AREA)

Abstract

The invention discloses a cluster area coverage method based on Deep Q-Learning, which comprises the following steps: establishing a dynamic model of a cluster system; determining neighbor sets of agents in the cluster; establishing a motion control model of the cluster system; constructing an information map and encoding the information map; defining a state space and a behavior space required by reinforcement learning according to the information map, and reporting a function; designing a network model required by a Deep Q-Learning algorithm; designing a Deep Q-Learning region coverage algorithm under the free region; and adjusting the obtained points as required to obtain a Deep Q-Learning area coverage algorithm under the obstructed area. The invention realizes the training and Learning of the cluster region coverage control algorithm by means of Deep Q-Learning technology, realizes the cluster region coverage under the free region and the obstacle region, effectively improves the cluster region coverage efficiency, and can ensure the stability of the algorithm under the weak communication environment.

Description

Deep Q-Learning-based cluster area coverage method
Technical Field
The invention belongs to the fields of multi-agent clusters and Q-Learning, and particularly relates to a cluster area coverage method based on Deep Q-Learning.
Background
The ideas of the multi-agent clusters are derived from observation and research of the natural animal cluster motion, for example, sharks can drive the fish shoal to the sea surface and then predate the fish shoal, and the wild goose group reduces the air resistance by maintaining a specific formation in the migration process, so that the multi-agent cluster is a bionic research. With the rise of artificial intelligence technology in recent years, intelligent control of robots, unmanned aerial vehicles, unmanned vehicles and the like has become a popular research field, and has made significant progress.
Cluster region coverage has important application and scientific research values, such as exploration of unknown regions, monitoring of target regions, and the like. The existing cluster area coverage method lacks effective utilization of historical coverage information, and the repeated coverage problem greatly reduces the operation efficiency of an algorithm, so that the efficiency of the area coverage algorithm is improved, the maximization of a search area is realized in the shortest time, and the method is an important research direction of multi-agent cluster search control.
Deep Q-Learning is an algorithm that uses Deep neural networks to replace the Q-value tables in traditional reinforcement Learning to optimize decisions. In the cluster region coverage process of the complex environment, multiple intelligent agents can learn the state and behavior characteristics by using the deep neural network and select strategies to plan the guide points. After the Deep Q-Learning algorithm is learned, an optimal guide point planning strategy can be obtained, so that the cluster can rapidly cover the target area.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides a cluster area coverage method based on Deep Q-Learning, which can realize the cluster area coverage under a free area and an obstacle area, effectively improve the cluster area coverage efficiency and ensure the stability of an algorithm under a weak communication environment.
The aim of the invention is realized by the following technical scheme: a cluster area coverage method based on Deep Q-Learning comprises the following steps:
a cluster area coverage method based on Deep Q-Learning comprises the following steps:
step S1, a dynamics model of a cluster system is established, n agents are contained in a cluster V, v= {1,2.., n }, an i-th agent in the cluster is defined as agent i, and a second-order dynamics model is defined as follows:
wherein pi Is the position of agent i, v i For the speed of agent i, u i Is an agent for the intelligent agenti acceleration, n is the total number of agents in the cluster, and />Represents p i 、v i Deriving the relative time;
step S2, determining a neighbor set of the agents in the cluster, wherein when the distance between the two agents in the cluster is smaller than the communication distance, the two agents are considered to establish communication connection and share the position and the speed, and the neighbor set of the agent i is described as follows:
N i ={j∈V:||p j -p i ||≤r α ,j≠i}
wherein V represents a set of all agents; r is (r) α Indicating the communication distance between agents, i·i is the euclidean norm, p i Is the position, p, of agent i j The position of agent j;
step S3, a motion control model of a cluster system is established, wherein alpha-agent represents an agent, beta-agent represents an obstacle detected by the agent, and gamma-agent represents a destination of the motion of the agent; respectively generating according to the alpha-agent, the beta-agent and the gamma-agentThe total motion control input of agent i is calculated as follows:
the method is used for ensuring that the clusters cannot collide with each other in the movement process;
is the obstacle avoidance control amount when the intelligent body moves in the space with the obstacle;
determining the movement direction of the intelligent body;
s4, quantifying an area to be traversed into a gamma-information map of M multiplied by L matrixes, wherein each quantized matrix center corresponds to one guide point gamma, converting complete search of the area into complete traversal of gamma points in the information map, all gamma points form a gamma-information map set of the agent i, and the agent i completes fusion updating of the information map set according to the information map set of the agent i and the information map set of a neighbor agent thereof to obtain a gamma-information map of the agent i and encodes the information map;
s5, defining a state space, a behavior space and a return function required by reinforcement learning according to the gamma-information map;
s6, designing a network model required by a Deep Q-Learning algorithm;
step S7, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6, determining a behavior selection strategy of the intelligent agent, continuously interacting with the environment through the behavior selection strategy and generating experience information, and training a Deep Q-Learning network by using the experience information;
and S8, designing a gamma point position adjustment strategy under the obstacle area, and adjusting the gamma point selected by the Deep Q-Learning network in the step S7 according to the requirement to obtain a Deep Q-Learning area coverage algorithm under the obstacle area.
In the above technical solution, in step 3:
the method is used for ensuring that clusters cannot collide with the intelligent bodies in the movement process, and is defined as follows:
wherein ,cα A constant greater than zero; to ensure that the norms are everywhere differentiable, define the sigma norms:
differentiating the sigma norm to obtain:
n ij is defined as follows:
φ α (z) is a potential energy function defined as follows:
φ α (z)=ρ h (z/r α )φ(z-d α )
wherein rα =||r a || σ ,r a Representing a communication distance between the agents; d, d α =||d|| σ D represents the ideal spacing between agents;
phi (z) is defined as follows:
wherein the following constraints are satisfied between a, b, c:
ρ h (z) is defined as follows:
ρ h (z) at phi α The design in (z) is to ensure the smoothing of the potential energy function, the potential energy is an integral process:
a ij (p) is an inter-agent adjacency matrix element defined as follows:
when the intelligent body moves in the space with the obstacle, the obstacle avoidance control quantity of the agent i is as followsThe definition is as follows:
wherein ,cβ Is a constant that is greater than zero and,set of obstacles detected for agent i, p i,k and vi,k Information indicating the position and speed of obstacle k detected by agent i, potential energy function φ β (z) is defined as follows:
wherein dβ =||d o || σ ,d o The ideal interval for the intelligent body to avoid the obstacle is shown;
the movement direction of the agent is determined as follows:
wherein , and />Proportional and differential control parameters, p, respectively, greater than zero γ The position of the guide point gamma;
in the above technical solution, the step 4 includes:
the area to be traversed is a rectangular area with M multiplied by L, the area is quantized into gamma-information maps with M multiplied by L matrixes, the center of each quantized matrix corresponds to one guide point gamma, the complete search of the area is converted into complete traversal of gamma points in the information map, and all gamma points form a gamma-information map set of agent i:
m i (γ)={γ x,y },x=1,2,...,m,y=1,2,...,l
wherein k and l are obtained by:
in the formula rs Representing the perceived radius of agent i; if agent i traverses the position of guide point gamma, then record m i (γ) =1, otherwise note m i (γ)=0;
And the agent i completes fusion updating of the information map according to the information map set of the agent i and the information map set of the neighbor agent, and an updating formula is defined as follows:
in the formula mix,y ) Gamma-information map, N, representing agent i i Set of neighbor agents for agent i, m sx,y ) All gamma-information maps of neighbor agents of agent i;
the gamma-information map of the intelligent agent is encoded, and the encoding process is as follows: the information map is vectorized according to columns, 8 binary data are sequentially and continuously taken out and hexadecimal coding is carried out on the binary data, and if the last remaining number is less than 8, 0 supplementing is needed to be carried out on the missing bits; after coding, each 8 binary data corresponds to 1 hexadecimal number, and when other intelligent agents receive the hexadecimal numbers, decoding is carried out according to the inverse operation of the coding process, and the hexadecimal numbers are restored into an original information map;
in the above technical solution, the step 5 includes:
defining a state space required by reinforcement learning, and for the agent i, the state construction method comprises the following steps: firstly, fusing the information maps of the agent i and the neighbor agents according to the update formula of the gamma-information map in the step S4; secondly, assigning a weight value 3 to the gamma point position of the agent i in the information map, and assigning a weight value 2 to the gamma point position of all neighbor agents; finally, linearly stretching the fused information map into a gray map with gray values of 0 to 255, namely, 0 in the information map corresponds to gray value 0, and 3 in the information map corresponds to gray value 255;
defining a behavior space required by reinforcement learning, wherein the behavior of an agent is represented by selecting a target gamma point, and the positions of the agent in a gamma-map and 8 optional gamma points around the positions are represented by 1 to 9, and the behavior space of agent i is defined as follows:
A i ={1,2,3,4,5,6,7,8,9}
if agent i is located at the edge of the information map, the behavior space is A i Is a subset of (a); in order to accelerate the training speed, the uncovered gamma points are selected as the optional behaviors of the intelligent agent in the training, and an optional behavior space A 'is generated' i The definition is as follows:
A′ i ={γ x,y ∈A i |m ix,y )=0}
the intelligent agent takes a gamma point corresponding to the behavior as a target point according to the selected behavior, and transmits the control quantity to the intelligent agent through the motion control model in the step S3 so as to enable the intelligent agent to move to the target point; in practical application, when formula |p ix,y |<∈ d When meeting, the agent is determined to reach the gamma point and E d Is an allowable distance error;
defining a return function required by reinforcement learning:
in the formula γ′x,y The next gamma point selected for agent i, T is the time consumed by the gamma-information graph traversal or completion of the region coverage process, R (T) is defined as follows:
wherein ,and->Are all positive constants, r ref For maximum return, T min Is the theoretical minimum coverage time for an area, which is defined as follows:
wherein MxL represents the size of the target traversal region, mxl represents the size of the gamma-information map, v max Is the maximum movement speed of the intelligent body;
in the above technical solution, the step 6 includes:
designing a network model required by a Deep Q-learning algorithm, setting the size of a convolution kernel to be 3 or 1 in order to avoid losing characteristic information in the convolution process, setting the step length of the convolution kernels of all convolution layers to be 1, and ensuring that the size of the output characteristic of each convolution layer is the same as that of an initial image by setting a padding parameter; in order to avoid the loss of the image characteristics in the pooling process, no pooling layer exists in the network structure; according to the principle, each layer in the Q network structure is designed as follows in sequence: the input dimension is 8×8×1, and the convolution kernel size is 3×3; the dimension of the convolution layer 1 is 8 multiplied by 32, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 2 is 8 multiplied by 64, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 3 is 8 multiplied by 128, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 4 is 8 multiplied by 128, and the convolution kernel size is 1 multiplied by 1; the dimension of the full connection layer is 64 multiplied by 1; the dimension of the output layer is 9 multiplied by 1;
the activation function selected in the training process is a ReLU function, the Loss function is Huber Loss, and the definition is as follows:
wherein s is the input state of the Q network, a is the behavior selected by the agent, Q (s, a) represents the current value network output, and Q (s ', a') represents the target value network output;
in the above technical scheme, step 7.1, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of steps S5 and S6, and determining a behavior selection policy of the agent: according to the optional behavior space A 'defined in the step S5' i Whether or not it is an empty set, the behavior selection of agent i can be divided into two cases:
in one of the two ways,in the training process, in order to enable all states to be trained, an epsilon-greedy method is adopted to select behaviors, and the behavior selection is determined by the following formula:
wherein rand (1) represents a random number taken in (0, 1)Number of machines, Q i (s i ,a i ) For the output of the target value network of agent i, function f sample Representing the slave selectable behavior space A' i Epsilon is an exploration variable defined as follows:
epsilon in the formula start and εend Respectively represent the initial value and the final value of epsilon, sigma ESP As an attenuation factor, episode_num is a round parameter episode in training, and epsilon is constantly 0 in the test process after training is completed;
and two, a second step of, in the second step,agent i can not reach the uncovered area no matter what action is selected, at this time, the uncovered gamma point closest to the point is selected, and a shortest path is selected, where the definition of action selection is as follows:
step 7.2, giving a Deep Q-Learning-based region coverage algorithm under a free region, enabling an intelligent agent to continuously interact with the environment through a behavior selection strategy and generate experience information, and training a Deep Q-Learning network by using the experience information;
the Deep Q-Learning based region coverage algorithm under the free region comprises the following steps:
step 7.2.1 initializing parameters λ, σ of Deep Q-Learning ESP ,r ref ,ε start ,ε end Control parameters of an agentParameter r of sensor s D, parameters m, n of the information map, capacity C of experience pool D max Batch parameter batch_size, network parameter update periodN TU
Step 7.2.2, initializing an information map for all agents, initializing a current value and a target value network model, and updating parameters of the target value network model by adopting the following steps:
step 7.2.2.1 traversing all rounds of numbers episode=1→n T Performing:
step 7.2.2.1.1, initializing the position and speed information of each agent;
step 7.2.2.1.2, initializing the state s, behavior a and gamma points of each agent;
step 7.2.2.1.3, if the current round does not complete the coverage, executing:
step 7.2.2.1.3.1, traversing all agents i=1→n:
step 7.2.2.1.3.2, the agent updates u according to the motion control model i ,v i and pi
Step 7.2.2.1.3.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i Updating state s i :=s′ i
Step 7.2.2.1.3.4, sample data (s i ,a i ,s′ i ,r i ) Storing the data into an experience pool D;
step 7.2.2.1.3.5, determining behavior a according to the behavior selection policy i And will act a i Converting into a corresponding gamma point;
step 7.2.2.1.3.6 if the number of samples in the experience pool is greater than the batch_size, randomly selecting batch_size data from D (s i ,a i ,s′ i ,r i ) Training the current value network with the selected sample as a sample, otherwise performing step 7.2.2.1.3.1;
step 7.2.2.1 if mod (epi-code, N TU ) = 0, copying parameters of the current value network to the target value network;
in the above technical solution, the step 8 includes:
step 8.1, adjusting the gamma point selected by the Deep Q-Learning network in step 7 according to the requirement to obtain Deep Q-Lear in the obstructed areaDetermining a gamma point position adjustment strategy under the obstacle area by using a ning area coverage algorithm: in the case that the gamma point is covered by the obstacle, the covering of the gamma point is not needed to be considered, and m is ix,y ) =1; for the situation that the obstacle approaches the gamma point, if the distance between the obstacle and the gamma point is smaller than the obstacle avoidance distance d of the intelligent body o The position of the gamma point needs to be adjusted, and the grid area where the gamma point is positioned is taken as a new guide point gamma o Is denoted as M obs ,γ o The point is determined by the following equation:
in the formula A set of obstacles detected for agent i; if->Let m be i (γ) =1; if->Then a point of maximum coverage is selected according to the following equation:
in the formula D1 Is the optimization area M obs Obstacle region D in (a) 2 Is the optimization area M obs In unobstructed areas, gamma' x,y Is a point coordinate that can replace the gamma point,is the optimal point for replacing gamma point;
the regional coverage under the obstacle region is similar to the regional coverage under the free region, when the distance between the obstacle and the gamma point is too short, the intelligent agent can realize the maximum coverage of the region by adjusting the position of the gamma point, so that if the Deep Q-Learning model under the free region is obtained through training, the Deep Q-Learning model under the obstacle region does not need to be trained;
step 8.2, a Deep Q-Learning based region coverage algorithm under the obstructed region comprises the following steps:
step 8.2.1, initializing the control parameter c of the agent α ,c β Etc., parameter r of sensor s D, etc., parameters m, n, etc. of the information map;
step 8.2.2, initializing an obstacle area, and initializing an information map for all agents;
step 8.2.3, initializing the position and speed information of each agent;
step 8.2.4, initializing the state s and the behavior a and gamma points of each agent;
step 8.2.5, loading a network model which is trained and completed under a free area;
step 8.2.6, if the current round does not complete the coverage, executing:
step 8.2.6.1, traversing all agents i=1→n:
step 8.2.6.2, the agent updates u according to the motion control model i ,v i and pi
Step 8.2.6.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i Updating state s i :=s′ i
Step 8.2.6.4, determining behavior a according to the behavior selection policy i Converting into a corresponding gamma point;
step 8.2.6.5 calculating the minimum obstacle avoidance distance d between the obstacle and the gamma point γo
Step 8.2.6.6, if d γo >d o The gamma point position is adjusted by adopting the following steps, otherwise, the step 8.2.6.1 is executed;
step 8.2.6.6.1, build M obs And calculating gamma o ,γ o Is a new guide point;
step 8.2.6.6.2 ifObtaining new +.>Point, otherwise let m i (γ)=1。
Because the invention adopts the technical scheme, the invention has the following beneficial effects:
the invention trains and learns the cluster area coverage control algorithm by means of Deep Q-Learning technology, realizes the cluster area coverage under the free area and the area with the obstacle, effectively improves the cluster area coverage efficiency, and can ensure the stability of the algorithm under the weak communication environment.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of an observation state construction process;
Detailed Description
The technical solution of the present invention will be described in further detail with reference to the accompanying drawings, but the scope of the present invention is not limited to the following description.
As shown in fig. 1, a Deep Q-Learning based cluster area coverage method includes the following steps:
a cluster area coverage method based on Deep Q-Learning comprises the following steps:
step S1, a dynamics model of a cluster system is established, n agents are contained in a cluster V, v= {1,2.., n }, an i-th agent in the cluster is defined as agent i, and a second-order dynamics model is defined as follows:
wherein pi Is the position of agent i, v i For the speed of agent i, u i Adding for agent iSpeed, n is the total number of agents in the cluster, and />Represents p i 、v i Deriving the relative time;
step S2, determining a neighbor set of the agents in the cluster, wherein when the distance between the two agents in the cluster is smaller than the communication distance, the two agents are considered to establish communication connection and share the position and the speed, and the neighbor set of the agent i is described as follows:
N i ={j∈V:||p j -p i ||≤r α ,j≠i}
wherein V represents a set of all agents; r is (r) α Indicating the communication distance between agents, i·i is the euclidean norm, p i Is the position, p, of agent i j The position of agent j;
step S3, a motion control model of a cluster system is established, wherein alpha-agent represents an agent, beta-agent represents an obstacle detected by the agent, and gamma-agent represents a destination of the motion of the agent; respectively generating according to the alpha-agent, the beta-agent and the gamma-agentThe total motion control input of agent i is calculated as follows:
the method is used for ensuring that the clusters cannot collide with each other in the movement process;
is the obstacle avoidance control amount when the intelligent body moves in the space with the obstacle;
determining the movement direction of the intelligent body;
s4, constructing an information map and encoding the information map;
step S5, defining a state space, a behavior space and a return function required by reinforcement learning according to the information map;
s6, designing a network model required by a Deep Q-Learning algorithm;
step S7, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6;
and S8, adjusting the gamma point obtained in the step S7 as required to obtain a Deep Q-Learning region coverage algorithm under the obstructed region.
In the above technical solution, in step 3:
the method is used for ensuring that clusters cannot collide with the intelligent bodies in the movement process, and is defined as follows:
wherein ,cα A constant greater than zero; to ensure that the norms are everywhere differentiable, define the sigma norms:
differentiating the sigma norm to obtain:
n ij is defined as follows:
φ α (z) is a potential energy function defined as follows:
φ α (z)=ρ h (z/r α )φ(z-d α )
wherein rα =||r a || σ ,r a Representing a communication distance between the agents; d, d α =||d|| σ D represents the ideal spacing between agents;
phi (z) is defined as follows:
wherein the following constraints are satisfied between a, b, c:
ρ h (z) is defined as follows:
ρ h (z) at phi α The design in (z) is to ensure the smoothing of the potential energy function, the potential energy is an integral process:
a ij (p) is an inter-agent adjacency matrix element defined as follows:
when the intelligent body moves in the space with the obstacle, the obstacle avoidance control quantity of the agent i is as followsThe definition is as follows:
wherein ,cβ Is a constant that is greater than zero and,set of obstacles detected for agent i, p i,k and vi,k Information indicating the position and speed of obstacle k detected by agent i, potential energy function φ β (z) is defined as follows:
wherein dβ =||d o || σ ,d o The ideal interval for the intelligent body to avoid the obstacle is shown;
the movement direction of the agent is determined as follows:
wherein , and />Proportional and differential control parameters, p, respectively, greater than zero γ The position of the guide point gamma;
in the above technical solution, the step 4 includes:
the area to be traversed is a rectangular area with M multiplied by L, the area is quantized into gamma-information maps with M multiplied by L matrixes, the center of each quantized matrix corresponds to one guide point gamma, the complete search of the area is converted into complete traversal of gamma points in the information map, and all gamma points form a gamma-information map set of agent i:
m i (γ)={γ x,y },x=1,2,...,m,y=1,2,...,l
wherein k and l are obtained by:
in the formula rs Representing the perceived radius of agent i; if agent i traverses the position of guide point gamma, then record m i (γ) =1, otherwise note m i (γ)=0;
And the agent i completes fusion updating of the information map according to the information map set of the agent i and the information map set of the neighbor agent, and an updating formula is defined as follows:
in the formula mix,y ) Gamma-information map, N, representing agent i i Set of neighbor agents for agent i, m sx,y ) All gamma-information maps of neighbor agents of agent i;
the gamma-information map of the intelligent agent is encoded, and the encoding process is as follows: the information map is vectorized according to columns, 8 binary data are sequentially and continuously taken out and hexadecimal coding is carried out on the binary data, and if the last remaining number is less than 8, 0 supplementing is needed to be carried out on the missing bits; after coding, each 8 binary data corresponds to 1 hexadecimal number, and when other intelligent agents receive the hexadecimal numbers, decoding is carried out according to the inverse operation of the coding process, and the hexadecimal numbers are restored into an original information map;
in the above technical solution, the step 5 includes:
as shown in fig. 2, a state space required for reinforcement learning is defined, and for agent i, the state construction method is as follows: firstly, fusing the information maps of the agent i and the neighbor agents according to the update formula of the gamma-information map in the step S4; secondly, assigning a weight value 3 to the gamma point position of the agent i in the information map, and assigning a weight value 2 to the gamma point position of all neighbor agents; finally, linearly stretching the fused information map into a gray map with gray values of 0 to 255, namely, 0 in the information map corresponds to gray value 0, and 3 in the information map corresponds to gray value 255;
defining a behavior space required by reinforcement learning, wherein the behavior of an agent is represented by selecting a target gamma point, and the positions of the agent in a gamma-map and 8 optional gamma points around the positions are represented by 1 to 9, and the behavior space of agent i is defined as follows:
A i ={1,2,3,4,5,6,7,8,9}
if agent i is located at the edge of the information map, the behavior space is A i Is a subset of (a); in order to accelerate the training speed, the uncovered gamma points are selected as the optional behaviors of the intelligent agent in the training, and an optional behavior space A 'is generated' i The definition is as follows:
A′ i ={γ x,y ∈A i |m ix,y )=0}
the intelligent agent takes a gamma point corresponding to the behavior as a target point according to the selected behavior, and transmits the control quantity to the intelligent agent through the motion control model in the step S3 so as to enable the intelligent agent to move to the target point; in practical application, when formula |p ix,y |<∈ d When meeting, the agent is determined to reach the gamma point and E d Is an allowable distance error;
defining a return function required by reinforcement learning:
in the formula γ′x,y The next gamma point selected for agent i, T is the time consumed by the gamma-information graph traversal or completion of the region coverage process, R (T) is defined as follows:
wherein ,and->Are all positive constants, r ref For maximum return, T min Is the theoretical minimum coverage time for an area, which is defined as follows:
wherein MxL represents the size of the target traversal region, mxl represents the size of the gamma-information map, v max Is the maximum movement speed of the intelligent body;
in the above technical solution, the step 6 includes:
designing a network model required by a Deep Q-learning algorithm, setting the size of a convolution kernel to be 3 or 1 in order to avoid losing characteristic information in the convolution process, setting the step length of the convolution kernels of all convolution layers to be 1, and ensuring that the size of the output characteristic of each convolution layer is the same as that of an initial image by setting a padding parameter; in order to avoid the loss of the image characteristics in the pooling process, no pooling layer exists in the network structure; according to the principle, each layer in the Q network structure is designed as follows in sequence: the input dimension is 8×8×1, and the convolution kernel size is 3×3; the dimension of the convolution layer 1 is 8 multiplied by 32, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 2 is 8 multiplied by 64, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 3 is 8 multiplied by 128, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 4 is 8 multiplied by 128, and the convolution kernel size is 1 multiplied by 1; the dimension of the full connection layer is 64 multiplied by 1; the dimension of the output layer is 9 multiplied by 1;
the activation function selected in the training process is a ReLU function, the Loss function is Huber Loss, and the definition is as follows:
wherein s is the input state of the Q network, a is the behavior selected by the agent, Q (s, a) represents the current value network output, and Q (s ', a') represents the target value network output;
in the above technical solution, the step 7 includes:
step 7.1, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6, and determining a behavior selection strategy of the intelligent agent: according to the optional behavior space A 'defined in the step S5' i Whether or not it is an empty set, the behavior selection of agent i can be divided into two cases:
in one of the two ways,in order to enable all states to be trained, the epsilon-greedy method is used to select behaviors, which are determined by the following formula: />
Wherein rand (1) represents a random number, Q, taken in (0, 1) i (s i ,a i ) For the output of the target value network of agent i, function f sample Representing the slave selectable behavior space A' i Epsilon is an exploration variable and is defined as follows:
epsilon in the formula start and εend Respectively represent the initial value and the final value of epsilon, sigma ESP As an attenuation factor, the epoode_num is a round parameter epoode in training;
and two, a second step of, in the second step,agent i can not reach the uncovered area no matter what action is selected, at this time, the uncovered gamma point closest to the point is selected, and a shortest path is selected, where the definition of action selection is as follows:
step 7.2, determining a behavior a according to a behavior selection strategy by giving a Deep Q-Learning-based region coverage algorithm under a free region i Converting into a corresponding gamma point, and copying parameters of a current value network to a target value network;
the Deep Q-Learning based region coverage algorithm under the free region comprises the following steps:
step 7.2.1 initializing parameters λ, σ of Deep Q-Learning ESP ,r ref ,ε start ,ε end Control parameters of an agentParameter r of sensor s D, parameters m, n of the information map, capacity C of experience pool D max Batch parameter batch_size;
step 7.2.2, initializing an information map for all agents, initializing a current value and a target value network model, and updating parameters of the target value network model by adopting the following steps:
step 7.2.2.1 traversing all rounds of numbers episode=1→n T Performing:
step 7.2.2.1.1, initializing the position and speed information of each agent;
step 7.2.2.1.2, initializing the state s and behavior a of each agent;
step 7.2.2.1.3, if the current round number is less than or equal to the total round number N T Performing:
step 7.2.2.1.3.1, traversing all agents i=1→n:
step 7.2.2.1.3.2, calculating u according to the motion control model of the agent i ,v i and pi
Step 7.2.2.1.3.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i
Step 7.2.2.1.3.4, sample data (s i ,a i ,s′ i ,r i ) Storing the data into an experience pool D; the method comprises the steps of carrying out a first treatment on the surface of the
Step 7.2.2.1.3.5, determining behavior a according to the behavior selection policy i And will act a i Converted into corresponding gamma point, and the intelligent agent is according to u i ,v i and pi Move to the gamma point and update the state s after the movement is finished i :=s′ i
Step 7.2.2.1.3.6 if the number of samples in the experience pool is greater than the batch_size, randomly selecting batch_size data from D (s i ,a i ,s′ i ,r i ) Training the current value network with the selected sample as a sample, otherwise performing step 7.2.2.1.3.2;
step 7.2.2.1 if mod (epi-code, N TU ) = 0, copy parameters of the current value network to the target value network, N TU Is an update period;
in the above technical solution, the step 8 includes:
and (3) adjusting the gamma point obtained in the step (S7) as required to obtain a Deep Q-Learning region coverage algorithm under the obstacle region, and determining a gamma point position adjusting method: in the case that the gamma point is covered by the obstacle, the covering of the gamma point is not needed to be considered, and m is ix,y ) =1; for the situation that the obstacle approaches the gamma point, if the distance between the obstacle and the gamma point is smaller than the obstacle avoidance distance d of the intelligent body o The position of the gamma point needs to be adjusted, and the grid area where the gamma point is positioned is taken as a new guide point gamma o Is denoted as M obs ,γ o The point is determined by the following equation:
in the formula A set of obstacles detected for agent i; if->Let m be i (γ) =1; if->Then a point of maximum coverage is selected according to the following equation:
in the formula D1 Is the optimization area M obs Obstacle region D in (a) 2 Is the optimization area M obs In unobstructed areas, gamma' x,y Is a point coordinate that can replace the gamma point,is the optimal point for replacing gamma point;
the regional coverage under the obstacle region is similar to the regional coverage under the free region, when the distance between the obstacle and the gamma point is too short, the intelligent agent can realize the maximum coverage of the region by adjusting the position of the gamma point, so that if the Deep Q-Learning model under the free region is obtained through training, the Deep Q-Learning model under the obstacle region does not need to be trained;
based on the steps S1 to S8, a Deep Q-Learning based area coverage algorithm under the obstacle area is given as shown in table-2:
TABLE-2 Deep Q-Learning based region coverage algorithm under obstructed regions
/>
The invention realizes the training and Learning of the cluster region coverage control algorithm by means of Deep Q-Learning technology, realizes the cluster region coverage under the free region and the obstacle region, effectively improves the cluster region coverage efficiency, and can ensure the stability of the algorithm under the weak communication environment.
The foregoing is a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein, but is not to be construed as limited to other embodiments, but is capable of other combinations, modifications and environments and is capable of changes or modifications within the scope of the inventive concept, either as a result of the foregoing teachings or as a result of the knowledge or knowledge of the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims (7)

1. The cluster area coverage method based on Deep Q-Learning is characterized by comprising the following steps of:
step S1, a dynamics model of a cluster system is established, n agents are contained in a cluster V, v= {1,2.., n }, an i-th agent in the cluster is defined as agent i, and a second-order dynamics model is defined as follows:
wherein pi Is the position of agent i, v i For the speed of agent i, u i Acceleration for agent i, n is the total number of agents in the cluster, and />Represents p i 、v i Deriving the relative time;
step S2, determining a neighbor set of the agents in the cluster, wherein when the distance between the two agents in the cluster is smaller than the communication distance, the two agents are considered to establish communication connection and share the position and the speed, and the neighbor set of the agent i is described as follows:
N i ={j∈V:||p j -p i ||≤r α ,j≠i}
wherein V represents a set of all agents; r is (r) α Indicating the communication distance between agents, i·i is the euclidean norm, p i Is the position, p, of agent i j The position of agent j;
step S3, a motion control model of a cluster system is established, wherein alpha-agent represents an agent, beta-agent represents an obstacle detected by the agent, and gamma-agent represents a destination of the motion of the agent; respectively generating according to the alpha-agent, the beta-agent and the gamma-agentThe total motion control input of agent i is calculated as follows:
the method is used for ensuring that the clusters cannot collide with each other in the movement process;
is the obstacle avoidance control amount when the intelligent body moves in the space with the obstacle;
determining the movement direction of the intelligent body;
s4, quantifying an area to be traversed into a gamma-information map of M multiplied by L matrixes, wherein each quantized matrix center corresponds to one guide point gamma, converting complete search of the area into complete traversal of gamma points in the information map, all gamma points form a gamma-information map set of the agent i, and the agent i completes fusion updating of the information map set according to the information map set of the agent i and the information map set of a neighbor agent thereof to obtain a gamma-information map of the agent i and encodes the information map;
s5, defining a state space, a behavior space and a return function required by reinforcement learning according to the gamma-information map;
s6, designing a network model required by a Deep Q-Learning algorithm;
step S7, designing a Deep Q-Learning region coverage algorithm under the free region based on the results of the steps S5 and S6, determining a behavior selection strategy of the intelligent agent, continuously interacting with the environment through the behavior selection strategy and generating experience information, and training a Deep Q-Learning network by using the experience information;
and S8, designing a gamma point position adjustment strategy under the obstacle area, and adjusting the gamma point selected by the Deep Q-Learning network in the step S7 according to the requirement to obtain a Deep Q-Learning area coverage algorithm under the obstacle area.
2. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: in step 3:
the method is used for ensuring that clusters cannot collide with the intelligent bodies in the movement process, and is defined as follows:
wherein ,ca A constant greater than zero; to ensure that the norms are everywhere differentiable, define the sigma norms:
differentiating the sigma norm to obtain:
n ij is defined as follows:
φ α (z) is a potential energy function defined as follows:
φ α (z)=ρ h (z/r a )φ(z-d α )
wherein rα =||r a || σ ,r a Representing a communication distance between the agents; d, d α =||d|| σ D represents the ideal spacing between agents;
phi (z) is defined as follows:
wherein the following constraints are satisfied between a, b, c:
ρ h (z) is defined as follows:
ρ h (z) at phi α The design in (z) is to ensure the smoothing of the potential energy function, the potential energy is an integral process:
a ij (p) is an inter-agent adjacency matrix element defined as follows:
when the intelligent body moves in the space with the obstacle, the obstacle avoidance control quantity of the agent i is as followsThe definition is as follows:
wherein ,cβ Is a constant that is greater than zero and,set of obstacles detected for agent i, p i,k and vi,k Information representing the position and velocity of obstacle k detected by agent i, potential energy function phi β (z) is defined as follows:
wherein dβ =||d o || σ ,d o The ideal interval for the intelligent body to avoid the obstacle is shown;
the movement direction of the agent is determined as follows:
wherein , and />Proportional and differential control parameters, p, respectively, greater than zero γ The position of the guide point γ is indicated.
3. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 4 comprises the following steps:
the area to be traversed is a rectangular area with M multiplied by L, the area is quantized into gamma-information maps with M multiplied by L matrixes, the center of each quantized matrix corresponds to one guide point gamma, the complete search of the area is converted into complete traversal of gamma points in the information map, and all gamma points form a gamma-information map set of agent i:
m i (y)={γ x,y },x=1,2,...,m,y=1,2,...,l
wherein k and l are obtained by:
in the formula rs Representing the perceived radius of agent i; if agent i traverses the position of guide point gamma, then record m i (γ) =1, otherwise note m i (γ)=0;
And the agent i completes fusion updating of the information map according to the information map set of the agent i and the information map set of the neighbor agent, and an updating formula is defined as follows:
in the formula mix,y ) Gamma-information map, N, representing agent i i Set of neighbor agents for agent i, m sx,y ) All gamma-information maps of neighbor agents of agent i;
the gamma-information map of the intelligent agent is encoded, and the encoding process is as follows: the information map is vectorized according to columns, 8 binary data are sequentially and continuously taken out and hexadecimal coding is carried out on the binary data, and if the last remaining number is less than 8, 0 supplementing is needed to be carried out on the missing bits; after coding, every 8 binary data corresponds to 1 hexadecimal number, and when other intelligent agents receive the hexadecimal number, decoding is carried out according to the inverse operation of the coding process, and the hexadecimal number is restored into the original information map.
4. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 5 comprises the following steps:
defining a state space required by reinforcement learning, and for the agent i, the state construction method comprises the following steps: firstly, fusing the information maps of the agent i and the neighbor agents according to the update formula of the gamma-information map in the step S4; secondly, assigning a weight value 3 to the gamma point position of the agent i in the information map, and assigning a weight value 2 to the gamma point position of all neighbor agents; finally, linearly stretching the fused information map into a gray map with gray values of 0 to 255, namely, 0 in the information map corresponds to gray value 0, and 3 in the information map corresponds to gray value 255;
defining a behavior space required by reinforcement learning, wherein the behavior of an agent is represented by selecting a target gamma point, and the positions of the agent in a gamma-map and 8 optional gamma points around the positions are represented by 1 to 9, and the behavior space of agent i is defined as follows:
A i ={1,2,3,4,5,6,7,8,9}
if agent i is located at the edge of the information map, the behavior space is A i Is a subset of (a); in order to accelerate the training speed, the uncovered gamma points are selected as the optional behaviors of the intelligent agent in the training, and an optional behavior space A 'is generated' i The definition is as follows:
A′ i ={γ x,y ∈A i |m ix,y )=0}
the intelligent agent takes a gamma point corresponding to the behavior as a target point according to the selected behavior, and transmits the control quantity to the intelligent agent through the motion control model in the step S3 so as to enable the intelligent agent to move to the target point; in practical application, when formula |p ix,y |<∈ d When meeting, the agent is determined to reach the gamma point and E d Is an allowable distance error;
defining a return function required by reinforcement learning:
in the formula γ′x,y The next gamma point selected for agent i, T is the time consumed by the gamma-information graph traversal or completion of the region coverage process, R (T) is defined as follows:
wherein ,and->Are all positive constants, r ref For maximum return, T min Is the theoretical minimum coverage time for an area, which is defined as follows:
wherein MxL represents the size of the target traversal region, mxl represents the size of the gamma-information map, v max Is the maximum movement speed of the agent.
5. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 6 comprises the following steps:
designing a network model required by a Deep Q-learning algorithm, setting the size of a convolution kernel to be 3 or 1 in order to avoid losing characteristic information in the convolution process, setting the step length of the convolution kernels of all convolution layers to be 1, and ensuring that the size of the output characteristic of each convolution layer is the same as that of an initial image by setting a padding parameter; in order to avoid the loss of the image characteristics in the pooling process, no pooling layer exists in the network structure; according to the principle, each layer in the Q network structure is designed as follows in sequence: the input dimension is 8×8×1, and the convolution kernel size is 3×3; the dimension of the convolution layer 1 is 8 multiplied by 32, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 2 is 8 multiplied by 64, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 3 is 8 multiplied by 128, and the convolution kernel size is 3 multiplied by 3; the dimension of the convolution layer 4 is 8 multiplied by 128, and the convolution kernel size is 1 multiplied by 1; the dimension of the full connection layer is 64 multiplied by 1; the dimension of the output layer is 9 multiplied by 1;
the activation function selected in the training process is a ReLU function, the Loss function is Huber Loss, and the definition is as follows:
where s is the input state of the Q network, a is the behavior selected by the agent, Q (s, a) represents the current value network output, and Q (s ', a') represents the target value network output.
6. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein:
step 7.1, designing the free region based on the results of steps S5 and S6The Deep Q-Learning region coverage algorithm of (2) determines a behavior selection strategy of the agent: according to the optional behavior space A 'defined in the step S5' i Whether or not it is an empty set, the behavior selection of agent i can be divided into two cases:
in one of the two ways,in the training process, in order to enable all states to be trained, an epsilon-greedy method is adopted to select behaviors, and the behavior selection is determined by the following formula:
wherein rand (1) represents a random number, Q, taken in (0, 1) i (s i ,a i ) For the output of the target value network of agent i, function f sample Representing the slave selectable behavior space A' i Epsilon is an exploration variable defined as follows:
epsilon in the formula start and εend Respectively represent the initial value and the final value of epsilon, sigma ESP As an attenuation factor, episode_num is a round parameter episode in training, and epsilon is constantly 0 in the test process after training is completed;
and two, a second step of, in the second step,agent i can not reach the uncovered area no matter what action is selected, at this time, the uncovered gamma point closest to the point is selected, and a shortest path is selected, where the definition of action selection is as follows:
step 7.2, giving a Deep Q-Learning-based region coverage algorithm under a free region, enabling an intelligent agent to continuously interact with the environment through a behavior selection strategy and generate experience information, and training a Deep Q-Learning network by using the experience information;
the Deep Q-Learning based region coverage algorithm under the free region comprises the following steps:
step 7.2.1 initializing parameters λ, σ of Deep Q-Learning ESP ,r ref ,ε start ,ε end Control parameter c of agent aParameter r of sensor s D, parameters m, n of the information map, capacity C of experience pool D max Batch parameter batch_size, network parameter update period N TU
Step 7.2.2, initializing an information map for all agents, initializing a current value and a target value network model, and updating parameters of the target value network model by adopting the following steps:
step 7.2.2.1 traversing all rounds of numbers episode=1→n T Performing:
step 7.2.2.1.1, initializing the position and speed information of each agent;
step 7.2.2.1.2, initializing the state s, behavior a and gamma points of each agent;
step 7.2.2.1.3, if the current round does not complete the coverage, executing:
step 7.2.2.1.3.1, traversing all agents i=1→n:
step 7.2.2.1.3.2, the agent updates u according to the motion control model i ,v i and pi
Step 7.2.2.1.3.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i Updating state s i :=s′ i
Step 7.2.2.1.3.4, sample data (s i ,a i ,s′ i ,r i ) Storing the data into an experience pool D;
step 7.2.2.1.3.5, determining behavior a according to the behavior selection policy i And will act a i Converting into a corresponding gamma point;
step 7.2.2.1.3.6 if the number of samples in the experience pool is greater than the batch_size, randomly selecting batch_size data from D (s i ,a i ,s′ i ,r i ) Training the current value network with the selected sample as a sample, otherwise performing step 7.2.2.1.3.1;
step 7.2.2.1 if mod (epi-code, N TU ) = 0, copy parameters of the current value network to the target value network.
7. The Deep Q-Learning based cluster area coverage method according to claim 1, wherein: the step 8 includes:
step 8.1, adjusting the gamma point selected by the Deep Q-Learning network in step S7 according to the requirement to obtain a Deep Q-Learning area coverage algorithm under the obstacle area, and determining a gamma point position adjustment strategy under the obstacle area: in the case that the gamma point is covered by the obstacle, the covering of the gamma point is not needed to be considered, and m is ix,y ) =1; for the situation that the obstacle approaches the gamma point, if the distance between the obstacle and the gamma point is smaller than the obstacle avoidance distance d of the intelligent body o The position of the gamma point needs to be adjusted, and the grid area where the gamma point is positioned is taken as a new guide point gamma o Is denoted as M obs ,γ o The point is determined by the following equation:
in the formula A set of obstacles detected for agent i; if->Let m be i (γ) =1; if->Then a point of maximum coverage is selected according to the following equation:
in the formula D1 Is the optimization area M obs Obstacle region D in (a) 2 Is the optimization area M obs In unobstructed areas, gamma' x,y Is a point coordinate that can replace the gamma point,is the optimal point for replacing gamma point;
the regional coverage under the obstacle region is similar to the regional coverage under the free region, when the distance between the obstacle and the gamma point is too short, the intelligent agent can realize the maximum coverage of the region by adjusting the position of the gamma point, so that if the Deep Q-Learning model under the free region is obtained through training, the Deep Q-Learning model under the obstacle region does not need to be trained;
step 8.2, a Deep Q-Learning based region coverage algorithm under the obstructed region comprises the following steps:
step 8.2.1, initializing the control parameter c of the agent a ,c βEtc., parameter r of sensor s D, etc., parameters m, n, etc. of the information map;
step 8.2.2, initializing an obstacle area, and initializing an information map for all agents;
step 8.2.3, initializing the position and speed information of each agent;
step 8.2.4, initializing the state s and the behavior a and gamma points of each agent;
step 8.2.5, loading a network model which is trained and completed under a free area;
step 8.2.6, if the current round does not complete the coverage, executing:
step 8.2.6.1, traversing all agents i=1→n:
step 8.2.6.2, the agent updates u according to the motion control model i ,v i and pi
Step 8.2.6.3, updating the information map; calculate the obtained return r i The method comprises the steps of carrying out a first treatment on the surface of the Construction State s' i Updating state s i :=s′ i
Step 8.2.6.4, determining behavior a according to the behavior selection policy i Converting into a corresponding gamma point;
step 8.2.6.5 calculating the minimum obstacle avoidance distance d between the obstacle and the gamma point γo
Step 8.2.6.6, if d γo <d o The gamma point position is adjusted by adopting the following steps, otherwise, the step 8.2.6.1 is executed;
step 8.2.6.6.1, build M obs And calculating gamma o ,γ o Is a new guide point;
step 8.2.6.6.2 ifObtaining new +.>Point, otherwise let m i (γ)=1。
CN202210026133.0A 2022-01-11 2022-01-11 Deep Q-Learning-based cluster area coverage method Active CN114326749B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210026133.0A CN114326749B (en) 2022-01-11 2022-01-11 Deep Q-Learning-based cluster area coverage method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210026133.0A CN114326749B (en) 2022-01-11 2022-01-11 Deep Q-Learning-based cluster area coverage method

Publications (2)

Publication Number Publication Date
CN114326749A CN114326749A (en) 2022-04-12
CN114326749B true CN114326749B (en) 2023-10-13

Family

ID=81026417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210026133.0A Active CN114326749B (en) 2022-01-11 2022-01-11 Deep Q-Learning-based cluster area coverage method

Country Status (1)

Country Link
CN (1) CN114326749B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006072477A (en) * 2004-08-31 2006-03-16 Nippon Telegr & Teleph Corp <Ntt> Dialogue strategy learning method, program, and device, and storage medium
CN111880565A (en) * 2020-07-22 2020-11-03 电子科技大学 Q-Learning-based cluster cooperative countermeasure method
CN111880564A (en) * 2020-07-22 2020-11-03 电子科技大学 Multi-agent area searching method based on collaborative reinforcement learning
CN113110478A (en) * 2021-04-27 2021-07-13 广东工业大学 Method, system and storage medium for multi-robot motion planning
CN113156954A (en) * 2021-04-25 2021-07-23 电子科技大学 Multi-agent cluster obstacle avoidance method based on reinforcement learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2557674B (en) * 2016-12-15 2021-04-21 Samsung Electronics Co Ltd Automated Computer Power Management System, Apparatus and Methods

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006072477A (en) * 2004-08-31 2006-03-16 Nippon Telegr & Teleph Corp <Ntt> Dialogue strategy learning method, program, and device, and storage medium
CN111880565A (en) * 2020-07-22 2020-11-03 电子科技大学 Q-Learning-based cluster cooperative countermeasure method
CN111880564A (en) * 2020-07-22 2020-11-03 电子科技大学 Multi-agent area searching method based on collaborative reinforcement learning
CN113156954A (en) * 2021-04-25 2021-07-23 电子科技大学 Multi-agent cluster obstacle avoidance method based on reinforcement learning
CN113110478A (en) * 2021-04-27 2021-07-13 广东工业大学 Method, system and storage medium for multi-robot motion planning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度Q网络学习的机器人端到端控制方法;张浩杰;苏治宝;苏波;;仪器仪表学报(第10期);全文 *

Also Published As

Publication number Publication date
CN114326749A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
Lamini et al. Genetic algorithm based approach for autonomous mobile robot path planning
CN110991972B (en) Cargo transportation system based on multi-agent reinforcement learning
CN110347155B (en) Intelligent vehicle automatic driving control method and system
CN112819253A (en) Unmanned aerial vehicle obstacle avoidance and path planning device and method
Hagras et al. Learning and adaptation of an intelligent mobile robot navigator operating in unstructured environment based on a novel online Fuzzy–Genetic system
CN111260026B (en) Navigation migration method based on meta reinforcement learning
CN111880564A (en) Multi-agent area searching method based on collaborative reinforcement learning
CN113299084B (en) Regional signal lamp cooperative control method based on multi-view coding migration reinforcement learning
CN116382267B (en) Robot dynamic obstacle avoidance method based on multi-mode pulse neural network
CN109540163A (en) A kind of obstacle-avoiding route planning algorithm combined based on differential evolution and fuzzy control
Janikow A genetic algorithm method for optimizing the fuzzy component of a fuzzy decision tree
Showalter et al. Neuromodulated multiobjective evolutionary neurocontrollers without speciation
Thabet et al. Sample-efficient deep reinforcement learning with imaginary rollouts for human-robot interaction
CN113110052A (en) Hybrid energy management method based on neural network and reinforcement learning
CN114326749B (en) Deep Q-Learning-based cluster area coverage method
CN116080688B (en) Brain-inspiring-like intelligent driving vision assisting method, device and storage medium
Lee et al. A genetic algorithm based robust learning credit assignment cerebellar model articulation controller
Showalter et al. Lamarckian inheritance in neuromodulated multiobjective evolutionary neurocontrollers
CN116300755A (en) Double-layer optimal scheduling method and device for heat storage-containing heating system based on MPC
CN109978133A (en) A kind of intensified learning moving method based on action mode
CN114859719A (en) Graph neural network-based reinforcement learning cluster bee-congestion control method
Showalter et al. Objective comparison and selection in mono-and multi-objective evolutionary neurocontrollers
Barto An approach to learning control surfaces by connectionist systems
Butz et al. REPRISE: A Retrospective and Prospective Inference Scheme.
CN115202339B (en) DQN-based multi-moon vehicle sampling fixed target self-adaptive planning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant