CN111880565A - Q-Learning-based cluster cooperative countermeasure method - Google Patents
Q-Learning-based cluster cooperative countermeasure method Download PDFInfo
- Publication number
- CN111880565A CN111880565A CN202010710580.9A CN202010710580A CN111880565A CN 111880565 A CN111880565 A CN 111880565A CN 202010710580 A CN202010710580 A CN 202010710580A CN 111880565 A CN111880565 A CN 111880565A
- Authority
- CN
- China
- Prior art keywords
- agent
- cluster
- group
- distance
- speed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 34
- 230000008569 process Effects 0.000 claims abstract description 32
- 230000008901 benefit Effects 0.000 claims abstract description 7
- 238000012856 packing Methods 0.000 claims abstract description 7
- 230000007246 mechanism Effects 0.000 claims abstract description 4
- 239000003795 chemical substances by application Substances 0.000 claims description 290
- 230000009471 action Effects 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 12
- 238000009795 derivation Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 8
- 230000006399 behavior Effects 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 7
- 238000004891 communication Methods 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 6
- 238000005381 potential energy Methods 0.000 claims description 6
- 239000000126 substance Substances 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 4
- 241000084490 Esenbeckia delta Species 0.000 claims description 3
- 230000001133 acceleration Effects 0.000 claims description 3
- 230000010354 integration Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000013139 quantization Methods 0.000 claims description 3
- 230000002787 reinforcement Effects 0.000 claims description 3
- 230000003068 static effect Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 4
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 241000251468 Actinopterygii Species 0.000 description 1
- 241001466804 Carnivora Species 0.000 description 1
- 241000251730 Chondrichthyes Species 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000002567 autonomic effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 244000062645 predators Species 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/10—Simultaneous control of position or course in three dimensions
- G05D1/101—Simultaneous control of position or course in three dimensions specially adapted for aircraft
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/10—Simultaneous control of position or course in three dimensions
- G05D1/101—Simultaneous control of position or course in three dimensions specially adapted for aircraft
- G05D1/104—Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses a cluster cooperative countermeasure method based on Q-Learning, which comprises the following steps: giving out a dynamic system of the agents in the cluster; determining a neighbor set of agents in a cluster; giving the motion process of two clusters in a mutually counterbalancing relationship; calculating control force in a packing algorithm; determining an obstacle avoidance mode; selecting an obstacle avoidance mode, and defining the distance between clusters; determining a cluster speed; designing relative polar coordinates; designing a state space for cooperative driving; designing a behavior space of a cluster; designing a reward and penalty mechanism; for the Q-learning algorithm, its Q-value table update function is given. The invention trains and learns the cluster control algorithm by means of the Q-Learning technology, effectively improves the cluster operation efficiency, maximizes benefits and ensures the stability of the cluster.
Description
Technical Field
The invention belongs to the field of multi-agent clustering and Q-Learning, and particularly relates to a Q-Learning-based cluster cooperative countermeasure method.
Background
Colonial antagonism is a common phenomenon, for example, the predation of other fish by shark populations and predators by carnivores in the ocean. In recent years, with the rise of artificial intelligence, the field of intelligent control has become a popular research field, and significant progress has been made in intelligent agents such as unmanned aerial vehicles, unmanned vehicles, and mobile robots. An agent is an individual with autonomic behavior and perception capabilities. Similarly, a multi-agent system is a system that is composed of a plurality of agents and can complete a certain task.
The existing clustering technologies mainly include two types, which are mainly classified into cluster formation control and cluster search control.
The cluster formation control mainly controls each intelligent agent to move according to a preset route and a preset formation to complete a set task, and the stability and the robustness among the intelligent agents are kept in the process. Such as: and performing cluster performance of the unmanned aerial vehicle cluster. The cluster search control mainly controls the agents to search a certain area to be detected, and realizes the maximization of the search area in the shortest time. However, the current clustering technology is still insufficient in terms of operation efficiency and benefit maximization design.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a cluster cooperative countermeasure method based on Q-Learning, which can effectively improve the cluster operation efficiency, maximize benefits and ensure the cluster stability.
The purpose of the invention is realized by the following technical scheme: a cluster cooperative countermeasure method based on Q-Learning comprises the following steps:
s1, describing a dynamic system of an intelligent agent in a cluster as a second-order integral system as follows:
wherein p isiFor the location of the ith agent in the cluster, viSpeed, u, of the ith agent in the clusteriThe acceleration of the ith agent in the cluster is the control input, and n is the total number of the agents in the cluster; whereinAndrepresents p fori、viDerivation is carried out;
s2, when the distance between two intelligent agents in the cluster is smaller than the communication distance, the two intelligent agents are considered to be connected, and the position and the speed are shared, and the neighbor set of the ith intelligent agent in the cluster is described as follows:
Ni a={j∈V:||pj-pi||≤r,j≠i};
wherein V represents a set of agents; r represents the communication distance between agent points, | | | · | |, is a euclidean norm;
s3, two clusters with a mutual counter relationship are set, an agent in the first cluster is an x _ agent, an agent in the second cluster is a y _ agent, the stability of the clusters is still kept when the y _ agent avoids the catching process of the x _ agent, and the x _ agent makes an autonomous decision;respectively representing the position, speed and control input of the ith x _ agent; in the same way, orderRespectively representing the position, speed and control input of the ith y _ agent;
the motion process of the ith x _ agent is described by the following equation:
wherein the content of the first and second substances,presentation pairThe derivation is carried out by the derivation,presentation pairDerivative, fQL(. is an implicit expression of QL, siIs a state variable of QL, QL represents Q-Learning,is the desired speed, fe(. is a speed control function;also referred to as the desired attack rate ifIf the magnitude of the x _ agent is constant, the attack speed is equal to the attack direction, and in order to reduce the learning state of the x _ agent and accelerate the training speed of the algorithm, the advancing direction needs to be discretized;
let us assume that the group of x _ agent is x _ group and the group of y _ agent is y _ group, and during the process of avoiding x _ group, the direction of y _ agent is mainly determined by the attack direction of x _ agent, and the same discrete quantization operation is performed on both the avoidance direction of y _ agent and the attack direction of x _ agent in order to match the generation of Q learning state of x _ agent; the packing algorithm takes the avoidance speed as input to obtain the control input of y _ agent; the process of the ith y _ agent is described as follows:
wherein f isa(. -) represents the y-agent avoidance algorithm, input PxAnd VxIs the detected x-agent's position and velocity,indicating the location of the ith y-agent,speed, output quantity of the ith y-agentIs the desired escape speed, fF(.) is an implicit expression of the flooding algorithm;
s4, in the packing algorithm, setting alpha-agent to represent the y _ agent of the intelligent agent, beta-agent to represent the x-agent of the intelligent agent, and gamma-agent to represent the moving destination of the y-agent of the intelligent agent; generated from alpha-agent, beta-agent, gamma-agent, respectively The total control force is calculated as follows:
for ensuring the stability of the internal topology of the cluster,the avoidance of the y-agent is realized,determining the movement direction of the y-agent;
s5, determining an obstacle avoidance mode:
the first detection range r of the x-agent in the y-agent0Within but not within the obstacle avoidance range d of the y-agent0In the method, because the distance is too far, the y-group clusters can complete collective obstacle avoidance without destroying the internal topological structures of the clusters to carry out respective obstacle avoidance, and in the obstacle avoidance mode, the y-agents have the same destination, and then the method goes to step S6;
secondly, the x-agent is in the obstacle avoidance range of the y-agent, and due to the fact that the distance is too short, if a collective obstacle avoidance mode is continuously adopted, the x-agent and the y-agent are likely to collide with each other; therefore, at this moment, the collective obstacle avoidance mode fails, and respective obstacle avoidance modes are adopted, in this mode, because the acting force of the x-agent on each y-agent in the cluster is different, the y-agents do not completely have the same movement direction, the original topological structure can be cracked, at this moment, the obstacle avoidance is carried out according to the formula definition of S4, according to the formula of S4,for ensuring the stability of the topology inside the y-group cluster,the y-agent can avoid the x-group,determining the movement direction of the y-agent, which is the direction vertical to the movement direction of the x-agent;
s6, defining the distance between the x-group and the y-group as follows:
whereinFor the jth y-agent, min () represents the minimum function; the basic idea of cluster obstacle avoidance is as follows: after the y-group cluster detects the x-group; the y-agent will select the x-agent perpendicular to the x-agent based on the detected direction of movement of the x-agentMoving in the direction of the moving direction, and calculating the destination according to the selected moving direction; when one y-agent detects a plurality of x-agents, the selected motion direction is the weighted sum of the vectors of the motion directions of the plurality of x-agents, the weight value of the motion direction represents the threat degree of the x-group and is determined by the distance between the x-agent and the y-agent;
s7, if the y-group cluster only detects one x-agent, the speed isThe evasion speed selected by the y-group cluster isAndare respectively asAndthe unit vector of (2) has:
if k x-agents are detected, the cluster velocity is given by:
wherein, wykThe threat level for x-agent is calculated by the following formula:
wherein η is a normalization factor;
s8, designing relative polar coordinates:
evenly dividing the angle of the whole plane of the polar coordinate into 32 parts to obtain an angle space
Ang={0,π/16,2π/16,...,31π/16}
According to the detection distance r of the x-agent and the y-agenta、roAnd the obstacle avoidance distance d of the y-agentoThe distance is divided into 4 shares, each representing a distance state, which is defined as follows:
wherein r isa,doThe following relationships are satisfied:
do<do+Δ<ra
Δ is an offset with respect to raIs small, i.e. delta < ra;
S9, designing a state space for collaborative driving: the x-group driving y-group is a cooperative process, so the state quantity of the learning algorithm is determined by the motion states of the x-agent and the adjacent additional y-agent, and in order to realize a cooperative driving mode, the state space is designed into the following expression:
si=[θy,θ1,d1,θ2,d2,...θk,dk];
the expression is a state expression when k y-agents are detected; thetayIs the angular deviation of the x _ group cluster from the destination, θiAnd diRespectively detecting the included angle and the distance between the ith x-agent and the y-agent, wherein theta belongs to Ang, and d belongs to Dis;
s10, designing a behavior space of the x-agent:
Ai=[1,2,3,...,32,33];
wherein 1 to 32 represent different attack directions, and correspond to the table values in S8 one by one, wherein 1 represents 0, 2 represents pi/16, 3 represents 2 pi/16, …,32 represents 31 pi/16, 33 represents x-agent is static, and the speed is 0;
s11, designing a reward and penalty mechanism:
in the formula dtyIndicating the distance of the y-agent from the destination,yto allow error value, when the distance between the x-agent and the y-agent is less than doThen a negative reward will be obtained, since the aim is to drive the y-group to the destination while maintaining the cluster topology of the y-group, so the distance is less than doIs not expected to occur; when the direction of motion of the y-group is directed to the destination, this is the behavior that is expected to be seen, thus giving a positive reward; when the y-group reaches the destination, the whole process is finished, and a large return is given, wherein the return indicates the correctness of the strategy of the whole process;
s12, a Q-learning algorithm is a value-based algorithm in a reinforcement learning algorithm, wherein Q is Q (S, a), namely, under the S state at a certain moment, action a is taken to obtain the expectation of income, wherein S belongs to S, S represents a state set, a belongs to A, and A represents the action set;
the environment feeds back a corresponding reward (r) according to the Action of the agent, and the main idea of the algorithm is to construct a Q-table by State and Action to store a Q value, and then to select the Action capable of obtaining the maximum benefit according to the Q value. The Q-value table update function is given as follows:
where k represents the kth training, α is the learning rate, γ is the discount factor, ai' denotes the next action, si' is the next state, riIs a reported value.
Preferably, the variable p in the step S1 is in three-dimensional spacei、viAnd uiThe dimensions of (a) are all 3.
In the step S4, in the above step,for ensuring the stability of the internal topology of the cluster,the obstacle can be avoided,determining the movement direction of the g-agent in the following way:
wherein the content of the first and second substances,andall are positive constants which are set control coefficients; p is a radical ofrAnd vrIs the location and direction of the destination, wherein
φα(z)=ρh(z/rα)φ(z-dα)
ρh(z) is in phiαThe design in (z) is to ensure the smoothness of the potential energy function, and the potential energy is an integration process:
to ensure that the norm is differentiable everywhere, we define a sigma norm:
differentiating the sigma norm
In particular when 1, there are:
from the definition of the sigma norm, the following parameters are obtained:
rα=||ra||σ
dα=||d||σ
nij=σ(pj-pi)。
the invention has the beneficial effects that: the invention trains and learns the cluster control algorithm by means of the Q-Learning technology, effectively improves the cluster operation efficiency, maximizes benefits and ensures the stability of the cluster.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a schematic diagram illustrating a process of avoiding x-agent by y-agent in the embodiment;
FIG. 3 is a schematic diagram of the division of relative discretized polar coordinates;
FIG. 4 is a flow chart of a simulation process.
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following.
As shown in fig. 1, a cluster cooperative countermeasure method based on Q-Learning includes the following steps:
s1, describing a dynamic system of an intelligent agent in a cluster as a second-order integral system as follows:
wherein p isiFor the location of the ith agent in the cluster, viSpeed, u, of the ith agent in the clusteriThe acceleration of the ith agent in the cluster is the control input, and n is the total number of the agents in the cluster; whereinAndrepresents p fori、viDerivation is carried out;
s2, when the distance between two intelligent agents in the cluster is smaller than the communication distance, the two intelligent agents are considered to be connected, and the position and the speed are shared, and the neighbor set of the ith intelligent agent in the cluster is described as follows:
wherein V represents a set of agents; r represents the communication distance between agent points, | | | · | |, is a euclidean norm;
s3, two clusters with a mutual counter relationship are set, an agent in the first cluster is an x _ agent, an agent in the second cluster is a y _ agent, the stability of the clusters is still kept when the y _ agent avoids the catching process of the x _ agent, and the x _ agent makes an autonomous decision;respectively representing the position, speed and control input of the ith x _ agent; in the same way, orderRespectively representing the position, speed and control input of the ith y _ agent;
the motion process of the ith x _ agent is described by the following equation:
wherein the content of the first and second substances,presentation pairThe derivation is carried out by the derivation,presentation pairDerivative, fRL(. is an implicit expression of QL, siIs a state variable of QL, QL represents Q-Learning,is the desired speed, fe(. is a speed control function;also referred to as the desired attack rate ifIf the magnitude of the x _ agent is constant, the attack speed is equal to the attack direction, and in order to reduce the learning state of the x _ agent and accelerate the training speed of the algorithm, the advancing direction needs to be discretized;
let us assume that the group of x _ agent is x _ group and the group of y _ agent is y _ group, and during the process of avoiding x _ group, the direction of y _ agent is mainly determined by the attack direction of x _ agent, and the same discrete quantization operation is performed on both the avoidance direction of y _ agent and the attack direction of x _ agent in order to match the generation of Q learning state of x _ agent; the packing algorithm takes the avoidance speed as input to obtain the control input of y _ agent; the process of the ith y _ agent is described as follows:
wherein f isa(. -) represents the y-agent avoidance algorithm, input PxAnd VxIs the detected x-agent's position and velocity,indicating the location of the ith y-agent,speed, output quantity of the ith y-agentIs the desired escape speed, fF(.) is an implicit expression of the flooding algorithm;
s4, in the packing algorithm, setting alpha-agent to represent the y _ agent of the intelligent agent, beta-agent to represent the x-agent of the intelligent agent, and gamma-agent to represent the moving destination of the y-agent of the intelligent agent; generated from alpha-agent, beta-agent, gamma-agent, respectively The total control force is calculated as follows:
for ensuring the stability of the internal topology of the cluster,the avoidance of the y-agent is realized,determining the movement direction of the y-agent;
s5, determining an obstacle avoidance mode:
the first detection range r of the x-agent in the y-agent0Within but not within the obstacle avoidance range d of the y-agent0In the method, because the distance is too far, the y-group clusters can complete collective obstacle avoidance without destroying the internal topological structures of the clusters to carry out respective obstacle avoidance, and in the obstacle avoidance mode, the y-agents have the same destination, and then the method goes to step S6;
secondly, the x-agent is in the obstacle avoidance range of the y-agent, and due to the fact that the distance is too short, if a collective obstacle avoidance mode is continuously adopted, the x-agent and the y-agent are likely to collide with each other; therefore, at this moment, the collective obstacle avoidance mode fails, and respective obstacle avoidance modes are adopted, in this mode, because the acting force of the x-agent on each y-agent in the cluster is different, the y-agents do not completely have the same movement direction, the original topological structure can be cracked, at this moment, the obstacle avoidance is carried out according to the formula definition of S4, according to the formula of S4,for ensuring the stability of the topology inside the y-group cluster,the y-agent can avoid the x-group,determining the movement direction of the y-agent, which is the direction vertical to the movement direction of the x-agent;
s6, defining the distance between the y-group and the y-group as follows:
whereinFor the jth y-agent, min () represents the minimum function; the basic idea of cluster obstacle avoidance is as follows: after the y-group cluster detects the x-agent(ii) a The y-group cluster selects a direction vertical to the movement direction of the x-agent to move according to the movement direction of the x-agent, and the destination is calculated according to the selected movement direction; when the y-group cluster detects a plurality of x-agents, the selected motion direction is the weighted sum of the vectors of the motion directions of the x-agents, the weight value of the vector represents the threat degree of the x-agents and is determined by the distance between the x-agents and the y-agents;
s7, if the y-group cluster only detects one x-agent, the speed isThe evasion speed selected by the y-group cluster isAndare respectively asAndthe unit vector of (2) has:
if k x-agents are detected, the cluster velocity is given by:
wherein, wykThe threat level for x-agent is calculated by the following formula:
wherein η is a normalization factor; as shown in FIG. 2, the process of the y-agent evading the x-agent is shown.
S8, designing relative polar coordinates: as shown in fig. 3, the angle of the entire plane of the polar coordinate is evenly divided into 32 to obtain an angle space;
Ang={0,π/16,2π/16,...,31π/16}
according to the detection distance r of the x-agent and the y-agenta、roAnd the obstacle avoidance distance d of the y-agentoThe distance is divided into 4 shares, each representing a distance state, which is defined as follows:
wherein r isa,doThe following relationships are satisfied:
do<do+Δ<ra
Δ is an offset with respect to raIs small, i.e. delta < ra;
S9, designing a state space for collaborative driving: the driving of the y-agent by the x-agent1 is a collaborative process, so the state quantity of the learning algorithm is determined by the motion state of the x-agent and the adjacent additional y-agent, and in order to realize the collaborative driving mode, the state space is designed into the following expression:
si=[θy,θ1,d1,θ2,d2,...θk,dk];
the expression is a state expression when k y-agents are detected; thetayIs the angular deviation of the y-group cluster from the destination, θiAnd diRespectively detecting the included angle and the distance between the ith x-agent and the y-agent, wherein theta belongs to Ang, and d belongs to Dis;
s10, designing a behavior space of the x-agent:
Ai=[1,2,3,...,32,33];
wherein 1 to 32 represent different attack directions, and correspond to the table values in S8, wherein 1 represents 0, 2 represents pi/16, 3 represents 2 pi/16, …,32 represents 31 pi/16, and 33 represents x-agent rest;
s11, designing a reward and penalty mechanism:
in the formula dtyIndicating the distance of the y-agent from the destination,yto allow error value, when the distance between the x-agent and the y-agent is less than doThere will be a negative reward, since the aim is to drive the y-agent to the destination while maintaining the cluster topology of the y-agent, so the distance is less than doIs not expected to occur; when the direction of movement of the y-agent points to the destination, this is the behavior that is expected to be seen, thus giving a positive reward; when the y-agent reaches the destination, indicating that the whole process is finished, giving a larger return, wherein the return indicates the correctness of the strategy of the whole process;
s12, a Q-learning algorithm is a value-based algorithm in a reinforcement learning algorithm, wherein Q is Q (S, a), namely, under the S state at a certain moment, action a is taken to obtain the expectation of income, wherein S belongs to S, S represents a state set, a belongs to A, and A represents the action set;
the environment feeds back a corresponding reward (r) according to the Action of the agent, and the main idea of the algorithm is to construct a Q-table by State and Action to store a Q value, and then to select the Action capable of obtaining the maximum benefit according to the Q value. The Q-value table update function is given as follows:
where k represents the kth training, α is the learning rate, γ is the discount factor, ai' denotes the next action, si' is the next state, riIs a reported value.
In the three-dimensional space, the variable p in the step S1i、viAnd uiThe dimensions of (a) are all 3.
In the step S4, in the above step,for protectingThe stability of the internal topology of the cluster is proved,the obstacle can be avoided,determining the movement direction of the g-agent in the following way:
wherein the content of the first and second substances,andall are positive constants which are set control coefficients; p is a radical ofrAnd vrIs the location and direction of the destination, wherein
φα(z)=ρh(z/rα)φ(z-dα)
ρh(z) is in phiαThe design in (z) is to ensure the smoothness of the potential energy function, and the potential energy is an integration process:
to ensure that the norm is differentiable everywhere, we define a sigma norm:
differentiating the sigma norm
In particular when 1, there are:
from the definition of the sigma norm, the following parameters are obtained:
rα=||ra||σ
dα=||d||σ
nij=σ(pj-pi)。
in the embodiment of the application, basic parameters of the simulation environment are set as follows: simulating according to the flow shown in FIG. 4, according to the coordinate size, the location of the destination, the number of x-agents and y-agents and the sensing range, creating a state space and a structure body array Q until all states are finished, and storing the training result in Q. In the training process, the value is gradually reduced, so that the traversed Q value is more accurate; the trained Q value is used for simulation, so that the process of mutual confrontation of the two agents can be realized under the environment, and the process is globally optimal.
The foregoing is a preferred embodiment of the present invention, it is to be understood that the invention is not limited to the form disclosed herein, but is not to be construed as excluding other embodiments, and is capable of other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (3)
1. A cluster cooperative countermeasure method based on Q-Learning is characterized in that: the method comprises the following steps:
s1, describing a dynamic system of an intelligent agent in a cluster as a second-order integral system as follows:
wherein p isiFor the location of the ith agent in the cluster, viSpeed, u, of the ith agent in the clusteriThe acceleration of the ith agent in the cluster is the control input, and n is the total number of the agents in the cluster; whereinAndrepresents p fori、viDerivation is carried out;
s2, when the distance between two intelligent agents in the cluster is smaller than the communication distance, the two intelligent agents are considered to be connected, and the position and the speed are shared, and the neighbor set of the ith intelligent agent in the cluster is described as follows:
Ni a={j∈V:||pj-pi||≤r,j≠i};
wherein V represents a set of agents; r represents the communication distance between agent points, | | | · | |, is a euclidean norm;
s3, two clusters with a mutual counter relationship are set, an agent in the first cluster is an x _ agent, an agent in the second cluster is a y _ agent, the stability of the clusters is still kept when the y _ agent avoids the catching process of the x _ agent, and the x _ agent makes an autonomous decision;respectively representing the position, speed and control input of the ith x _ agent; in the same way, orderRespectively representing the position, speed and control input of the ith y _ agent;
the motion process of the ith x _ agent is described by the following equation:
wherein the content of the first and second substances,presentation pairThe derivation is carried out by the derivation,presentation pairDerivative, fQL(. is an implicit expression of QL, siIs a state variable of QL, QL represents Q-Learning,is the desired speed, fe(. is a speed control function;also referred to as the desired attack rate ifIf the magnitude of the x _ agent is constant, the attack speed is equal to the attack direction, and in order to reduce the learning state of the x _ agent and accelerate the training speed of the algorithm, the advancing direction needs to be discretized;
let us assume that the group of x _ agent is x _ group and the group of y _ agent is y _ group, and during the process of avoiding x _ group, the direction of y _ agent is mainly determined by the attack direction of x _ agent, and the same discrete quantization operation is performed on both the avoidance direction of y _ agent and the attack direction of x _ agent in order to match the generation of Q learning state of x _ agent; the packing algorithm takes the avoidance speed as input to obtain the control input of y _ agent; the process of the ith y _ agent is described as follows:
wherein f isa(. -) represents the y-agent avoidance algorithm, input PxAnd VxIs the detected x-agent's position and velocity,indicating the location of the ith y-agent,speed, output quantity of the ith y-agentIs the desired escape speed, fF(.) is an implicit expression of the flooding algorithm;
s4, in the packing algorithm, setting alpha-agent to represent the y _ agent of the intelligent agent, beta-agent to represent the x-agent of the intelligent agent, and gamma-agent to represent the moving destination of the y-agent of the intelligent agent; generated from alpha-agent, beta-agent, gamma-agent, respectively The total control force is calculated as follows:
for ensuring the stability of the internal topology of the cluster,the avoidance of the y-agent is realized,determining the movement direction of the y-agent;
s5, determining an obstacle avoidance mode:
the first detection range r of the x-agent in the y-agent0Within but not within the obstacle avoidance range d of the y-agent0In the method, because the distance is too far, the y-group clusters can complete collective obstacle avoidance without destroying the internal topological structures of the clusters to carry out respective obstacle avoidance, and in the obstacle avoidance mode, the y-agents have the same destination, and then the method goes to step S6;
secondly, the x-agent is in the obstacle avoidance range of the y-agent, and due to the fact that the distance is too short, if a collective obstacle avoidance mode is continuously adopted, the x-agent and the y-agent are likely to collide with each other; therefore, at this moment, the collective obstacle avoidance mode fails, and respective obstacle avoidance modes are adopted, in this mode, because the acting force of the x-agent on each y-agent in the cluster is different, the y-agents do not completely have the same movement direction, the original topological structure can be cracked, at this moment, the obstacle avoidance is carried out according to the formula definition of S4, according to the formula of S4,for ensuring the stability of the topology inside the y-group cluster,the y-agent can avoid the x-group,determining the movement direction of the y-agent, which is the direction vertical to the movement direction of the x-agent;
s6, defining the distance between the x-group and the y-group as follows:
whereinFor the jth y-agent, min () represents the minimum function; the basic idea of cluster obstacle avoidance is as follows: after the y-group cluster detects the x-group; the y-agent selects a direction vertical to the movement direction of the x-agent to move according to the detected movement direction of the x-agent, and the destination is calculated according to the selected movement direction; when one y-agent detects a plurality of x-agents, the selected motion direction is the weighted sum of the vectors of the motion directions of the plurality of x-agents, the weight value of the motion direction represents the threat degree of the x-group and is determined by the distance between the x-agent and the y-agent;
s7, if the y-group cluster only detects one x-agent, the speed isThe evasion speed selected by the y-group cluster isAndare respectively asAndthe unit vector of (2) has:
if k x-agents are detected, the cluster velocity is given by:
wherein, wykThe threat level for x-agent is calculated by the following formula:
wherein η is a normalization factor;
s8, designing relative polar coordinates:
evenly dividing the angle of the whole plane of the polar coordinate into 32 parts to obtain an angle space
Ang={0,π/16,2π/16,...,31π/16}
According to the detection distance r of the x-agent and the y-agenta、roAnd the obstacle avoidance distance d of the y-agentoThe distance is divided into 4 shares, each representing a distance state, which is defined as follows:
wherein r isa,doThe following relationships are satisfied:
do<do+Δ<ra
Δ is an offset with respect to raIs small, i.e. delta < ra;
S9, designing a state space for collaborative driving: the x-group driving y-group is a cooperative process, so the state quantity of the learning algorithm is determined by the motion states of the x-agent and the adjacent additional y-agent, and in order to realize a cooperative driving mode, the state space is designed into the following expression:
si=[θy,θ1,d1,θ2,d2,...θk,dk];
the expression is a state expression when k y-agents are detected; thetayIs the angular deviation of the x _ group cluster from the destination, θiAnd diRespectively detecting the included angle and the distance between the ith x-agent and the y-agent, wherein theta belongs to Ang, and d belongs to Dis;
s10, designing a behavior space of the x-agent:
Ai=[1,2,3,...,32,33];
wherein 1 to 32 represent different attack directions, and correspond to the table values in S8 one by one, wherein 1 represents 0, 2 represents pi/16, 3 represents 2 pi/16, …,32 represents 31 pi/16, 33 represents x-agent is static, and the speed is 0;
s11, designing a reward and penalty mechanism:
in the formula dtyIndicating the distance of the y-agent from the destination,yto allow error value, when the distance between the x-agent and the y-agent is less than doThen a negative reward will be obtained, since the aim is to drive the y-group to the destination while maintaining the cluster topology of the y-group, so the distance is less than doIs not expected to occur; when the direction of motion of the y-group is directed to the destination, this is the behavior that is expected to be seen, thus giving a positive reward; when the y-group reaches the destination, the whole process is finished, and a large return is given, wherein the return indicates the correctness of the strategy of the whole process;
s12, a Q-learning algorithm is a value-based algorithm in a reinforcement learning algorithm, wherein Q is Q (S, a), namely, under the S state at a certain moment, action a is taken to obtain the expectation of income, wherein S belongs to S, S represents a state set, a belongs to A, and A represents the action set;
the environment feeds back a corresponding reward (r) according to the Action of the agent, and the main idea of the algorithm is to construct a Q-table by State and Action to store a Q value, and then to select the Action capable of obtaining the maximum benefit according to the Q value. The Q-value table update function is given as follows:
where k represents the kth training, α is the learning rate, γ is the discount factor, ai' denotes the next action, si' is the next state, riIs a reported value.
2. The Q-Learning based cluster cooperative countermeasure method of claim 1, wherein: in the three-dimensional space, the variable p in the step S1i、viAnd uiThe dimensions of (a) are all 3.
3. The Q-Learning based cluster cooperative countermeasure method of claim 1, wherein: in the step S4, in the above step,for ensuring the stability of the internal topology of the cluster,the obstacle can be avoided,determining the movement direction of the g-agent in the following way:
wherein the content of the first and second substances,andall are positive constants which are set control coefficients; p is a radical ofrAnd vrIs the location and direction of the destination, wherein
φα(z)=ρh(z/rα)φ(z-dα)
ρh(z) is in phiαThe design in (z) is to ensure the smoothness of the potential energy function, and the potential energy is an integration process:
to ensure that the norm is differentiable everywhere, we define a sigma norm:
differentiating the sigma norm
In particular when 1, there are:
from the definition of the sigma norm, the following parameters are obtained:
rα=||ra||σ
dα=||d||σ
nij=σ(pj-pi)。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010710580.9A CN111880565A (en) | 2020-07-22 | 2020-07-22 | Q-Learning-based cluster cooperative countermeasure method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010710580.9A CN111880565A (en) | 2020-07-22 | 2020-07-22 | Q-Learning-based cluster cooperative countermeasure method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111880565A true CN111880565A (en) | 2020-11-03 |
Family
ID=73155229
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010710580.9A Pending CN111880565A (en) | 2020-07-22 | 2020-07-22 | Q-Learning-based cluster cooperative countermeasure method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111880565A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112947581A (en) * | 2021-03-25 | 2021-06-11 | 西北工业大学 | Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning |
CN113033756A (en) * | 2021-03-25 | 2021-06-25 | 重庆大学 | Multi-agent control method based on target-oriented aggregation strategy |
CN113156954A (en) * | 2021-04-25 | 2021-07-23 | 电子科技大学 | Multi-agent cluster obstacle avoidance method based on reinforcement learning |
CN113359824A (en) * | 2021-05-31 | 2021-09-07 | 杭州电子科技大学 | Unmanned aerial vehicle cluster control method based on fuzzy model |
CN113885576A (en) * | 2021-10-29 | 2022-01-04 | 南京航空航天大学 | Unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning |
CN114326749A (en) * | 2022-01-11 | 2022-04-12 | 电子科技大学长三角研究院(衢州) | Deep Q-Learning-based cluster area coverage method |
-
2020
- 2020-07-22 CN CN202010710580.9A patent/CN111880565A/en active Pending
Non-Patent Citations (1)
Title |
---|
肖剑: "基于增强学习的Flocking集群协同控制算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112947581A (en) * | 2021-03-25 | 2021-06-11 | 西北工业大学 | Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning |
CN113033756A (en) * | 2021-03-25 | 2021-06-25 | 重庆大学 | Multi-agent control method based on target-oriented aggregation strategy |
CN112947581B (en) * | 2021-03-25 | 2022-07-05 | 西北工业大学 | Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning |
CN113033756B (en) * | 2021-03-25 | 2022-09-16 | 重庆大学 | Multi-agent control method based on target-oriented aggregation strategy |
CN113156954A (en) * | 2021-04-25 | 2021-07-23 | 电子科技大学 | Multi-agent cluster obstacle avoidance method based on reinforcement learning |
CN113359824A (en) * | 2021-05-31 | 2021-09-07 | 杭州电子科技大学 | Unmanned aerial vehicle cluster control method based on fuzzy model |
CN113885576A (en) * | 2021-10-29 | 2022-01-04 | 南京航空航天大学 | Unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning |
CN114326749A (en) * | 2022-01-11 | 2022-04-12 | 电子科技大学长三角研究院(衢州) | Deep Q-Learning-based cluster area coverage method |
CN114326749B (en) * | 2022-01-11 | 2023-10-13 | 电子科技大学长三角研究院(衢州) | Deep Q-Learning-based cluster area coverage method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111880565A (en) | Q-Learning-based cluster cooperative countermeasure method | |
CN110632931B (en) | Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment | |
CN110703766B (en) | Unmanned aerial vehicle path planning method based on transfer learning strategy deep Q network | |
Zhu et al. | Task assignment and path planning of a multi-AUV system based on a Glasius bio-inspired self-organising map algorithm | |
CN111780777A (en) | Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning | |
Wang et al. | A survey of underwater search for multi-target using Multi-AUV: Task allocation, path planning, and formation control | |
CN110347181B (en) | Energy consumption-based distributed formation control method for unmanned aerial vehicles | |
CN111176122B (en) | Underwater robot parameter self-adaptive backstepping control method based on double BP neural network Q learning technology | |
CN111240345A (en) | Underwater robot trajectory tracking method based on double BP network reinforcement learning framework | |
Cao et al. | Hunting algorithm for multi-auv based on dynamic prediction of target trajectory in 3d underwater environment | |
CN108919818B (en) | Spacecraft attitude orbit collaborative planning method based on chaotic population variation PIO | |
CN109947131A (en) | A kind of underwater multi-robot formation control method based on intensified learning | |
CN115993781B (en) | Network attack resistant unmanned cluster system cooperative control method, terminal and storage medium | |
CN109540163A (en) | A kind of obstacle-avoiding route planning algorithm combined based on differential evolution and fuzzy control | |
CN113156954A (en) | Multi-agent cluster obstacle avoidance method based on reinforcement learning | |
CN114089776A (en) | Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning | |
CN114518770A (en) | Unmanned aerial vehicle path planning method integrating potential field and deep reinforcement learning | |
CN114003059A (en) | UAV path planning method based on deep reinforcement learning under kinematic constraint condition | |
CN113759935B (en) | Intelligent group formation mobile control method based on fuzzy logic | |
Yan et al. | Flocking and collision avoidance for a dynamic squad of fixed-wing UAVs using deep reinforcement learning | |
Nabizadeh et al. | A multi-swarm cellular PSO based on clonal selection algorithm in dynamic environments | |
Chen et al. | A multirobot cooperative area coverage search algorithm based on bioinspired neural network in unknown environments | |
CN112306097A (en) | Novel unmanned aerial vehicle path planning method | |
CN116954258A (en) | Hierarchical control method and device for multi-four-rotor unmanned aerial vehicle formation under unknown disturbance | |
CN115542921A (en) | Autonomous path planning method for multiple robots |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201103 |