CN111880565A - Q-Learning-based cluster cooperative countermeasure method - Google Patents

Q-Learning-based cluster cooperative countermeasure method Download PDF

Info

Publication number
CN111880565A
CN111880565A CN202010710580.9A CN202010710580A CN111880565A CN 111880565 A CN111880565 A CN 111880565A CN 202010710580 A CN202010710580 A CN 202010710580A CN 111880565 A CN111880565 A CN 111880565A
Authority
CN
China
Prior art keywords
agent
cluster
group
distance
speed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010710580.9A
Other languages
Chinese (zh)
Inventor
王刚
肖剑
薛玉玺
黄治宇
田新宇
孙奇
成雷
王钰瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010710580.9A priority Critical patent/CN111880565A/en
Publication of CN111880565A publication Critical patent/CN111880565A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a cluster cooperative countermeasure method based on Q-Learning, which comprises the following steps: giving out a dynamic system of the agents in the cluster; determining a neighbor set of agents in a cluster; giving the motion process of two clusters in a mutually counterbalancing relationship; calculating control force in a packing algorithm; determining an obstacle avoidance mode; selecting an obstacle avoidance mode, and defining the distance between clusters; determining a cluster speed; designing relative polar coordinates; designing a state space for cooperative driving; designing a behavior space of a cluster; designing a reward and penalty mechanism; for the Q-learning algorithm, its Q-value table update function is given. The invention trains and learns the cluster control algorithm by means of the Q-Learning technology, effectively improves the cluster operation efficiency, maximizes benefits and ensures the stability of the cluster.

Description

Q-Learning-based cluster cooperative countermeasure method
Technical Field
The invention belongs to the field of multi-agent clustering and Q-Learning, and particularly relates to a Q-Learning-based cluster cooperative countermeasure method.
Background
Colonial antagonism is a common phenomenon, for example, the predation of other fish by shark populations and predators by carnivores in the ocean. In recent years, with the rise of artificial intelligence, the field of intelligent control has become a popular research field, and significant progress has been made in intelligent agents such as unmanned aerial vehicles, unmanned vehicles, and mobile robots. An agent is an individual with autonomic behavior and perception capabilities. Similarly, a multi-agent system is a system that is composed of a plurality of agents and can complete a certain task.
The existing clustering technologies mainly include two types, which are mainly classified into cluster formation control and cluster search control.
The cluster formation control mainly controls each intelligent agent to move according to a preset route and a preset formation to complete a set task, and the stability and the robustness among the intelligent agents are kept in the process. Such as: and performing cluster performance of the unmanned aerial vehicle cluster. The cluster search control mainly controls the agents to search a certain area to be detected, and realizes the maximization of the search area in the shortest time. However, the current clustering technology is still insufficient in terms of operation efficiency and benefit maximization design.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a cluster cooperative countermeasure method based on Q-Learning, which can effectively improve the cluster operation efficiency, maximize benefits and ensure the cluster stability.
The purpose of the invention is realized by the following technical scheme: a cluster cooperative countermeasure method based on Q-Learning comprises the following steps:
s1, describing a dynamic system of an intelligent agent in a cluster as a second-order integral system as follows:
Figure BDA0002596375220000011
wherein p isiFor the location of the ith agent in the cluster, viSpeed, u, of the ith agent in the clusteriThe acceleration of the ith agent in the cluster is the control input, and n is the total number of the agents in the cluster; wherein
Figure BDA0002596375220000012
And
Figure BDA0002596375220000013
represents p fori、viDerivation is carried out;
s2, when the distance between two intelligent agents in the cluster is smaller than the communication distance, the two intelligent agents are considered to be connected, and the position and the speed are shared, and the neighbor set of the ith intelligent agent in the cluster is described as follows:
Ni a={j∈V:||pj-pi||≤r,j≠i};
wherein V represents a set of agents; r represents the communication distance between agent points, | | | · | |, is a euclidean norm;
s3, two clusters with a mutual counter relationship are set, an agent in the first cluster is an x _ agent, an agent in the second cluster is a y _ agent, the stability of the clusters is still kept when the y _ agent avoids the catching process of the x _ agent, and the x _ agent makes an autonomous decision;
Figure BDA00025963752200000213
respectively representing the position, speed and control input of the ith x _ agent; in the same way, order
Figure BDA00025963752200000214
Respectively representing the position, speed and control input of the ith y _ agent;
the motion process of the ith x _ agent is described by the following equation:
Figure BDA0002596375220000021
wherein the content of the first and second substances,
Figure BDA0002596375220000022
presentation pair
Figure BDA0002596375220000023
The derivation is carried out by the derivation,
Figure BDA0002596375220000024
presentation pair
Figure BDA0002596375220000025
Derivative, fQL(. is an implicit expression of QL, siIs a state variable of QL, QL represents Q-Learning,
Figure BDA0002596375220000026
is the desired speed, fe(. is a speed control function;
Figure BDA0002596375220000027
also referred to as the desired attack rate if
Figure BDA0002596375220000028
If the magnitude of the x _ agent is constant, the attack speed is equal to the attack direction, and in order to reduce the learning state of the x _ agent and accelerate the training speed of the algorithm, the advancing direction needs to be discretized;
let us assume that the group of x _ agent is x _ group and the group of y _ agent is y _ group, and during the process of avoiding x _ group, the direction of y _ agent is mainly determined by the attack direction of x _ agent, and the same discrete quantization operation is performed on both the avoidance direction of y _ agent and the attack direction of x _ agent in order to match the generation of Q learning state of x _ agent; the packing algorithm takes the avoidance speed as input to obtain the control input of y _ agent; the process of the ith y _ agent is described as follows:
Figure BDA0002596375220000029
wherein f isa(. -) represents the y-agent avoidance algorithm, input PxAnd VxIs the detected x-agent's position and velocity,
Figure BDA00025963752200000215
indicating the location of the ith y-agent,
Figure BDA00025963752200000210
speed, output quantity of the ith y-agent
Figure BDA00025963752200000211
Is the desired escape speed, fF(.) is an implicit expression of the flooding algorithm;
s4, in the packing algorithm, setting alpha-agent to represent the y _ agent of the intelligent agent, beta-agent to represent the x-agent of the intelligent agent, and gamma-agent to represent the moving destination of the y-agent of the intelligent agent; generated from alpha-agent, beta-agent, gamma-agent, respectively
Figure BDA00025963752200000212
Figure BDA0002596375220000031
The total control force is calculated as follows:
Figure BDA0002596375220000032
Figure BDA0002596375220000033
for ensuring the stability of the internal topology of the cluster,
Figure BDA0002596375220000034
the avoidance of the y-agent is realized,
Figure BDA0002596375220000035
determining the movement direction of the y-agent;
s5, determining an obstacle avoidance mode:
the first detection range r of the x-agent in the y-agent0Within but not within the obstacle avoidance range d of the y-agent0In the method, because the distance is too far, the y-group clusters can complete collective obstacle avoidance without destroying the internal topological structures of the clusters to carry out respective obstacle avoidance, and in the obstacle avoidance mode, the y-agents have the same destination, and then the method goes to step S6;
secondly, the x-agent is in the obstacle avoidance range of the y-agent, and due to the fact that the distance is too short, if a collective obstacle avoidance mode is continuously adopted, the x-agent and the y-agent are likely to collide with each other; therefore, at this moment, the collective obstacle avoidance mode fails, and respective obstacle avoidance modes are adopted, in this mode, because the acting force of the x-agent on each y-agent in the cluster is different, the y-agents do not completely have the same movement direction, the original topological structure can be cracked, at this moment, the obstacle avoidance is carried out according to the formula definition of S4, according to the formula of S4,
Figure BDA0002596375220000036
for ensuring the stability of the topology inside the y-group cluster,
Figure BDA0002596375220000037
the y-agent can avoid the x-group,
Figure BDA0002596375220000038
determining the movement direction of the y-agent, which is the direction vertical to the movement direction of the x-agent;
s6, defining the distance between the x-group and the y-group as follows:
Figure BDA0002596375220000039
wherein
Figure BDA00025963752200000310
For the jth y-agent, min () represents the minimum function; the basic idea of cluster obstacle avoidance is as follows: after the y-group cluster detects the x-group; the y-agent will select the x-agent perpendicular to the x-agent based on the detected direction of movement of the x-agentMoving in the direction of the moving direction, and calculating the destination according to the selected moving direction; when one y-agent detects a plurality of x-agents, the selected motion direction is the weighted sum of the vectors of the motion directions of the plurality of x-agents, the weight value of the motion direction represents the threat degree of the x-group and is determined by the distance between the x-agent and the y-agent;
s7, if the y-group cluster only detects one x-agent, the speed is
Figure BDA00025963752200000311
The evasion speed selected by the y-group cluster is
Figure BDA00025963752200000312
And
Figure BDA00025963752200000313
are respectively as
Figure BDA00025963752200000314
And
Figure BDA00025963752200000315
the unit vector of (2) has:
Figure BDA0002596375220000041
if k x-agents are detected, the cluster velocity is given by:
Figure BDA0002596375220000042
wherein, wykThe threat level for x-agent is calculated by the following formula:
Figure BDA0002596375220000043
wherein η is a normalization factor;
s8, designing relative polar coordinates:
evenly dividing the angle of the whole plane of the polar coordinate into 32 parts to obtain an angle space
Ang={0,π/16,2π/16,...,31π/16}
According to the detection distance r of the x-agent and the y-agenta、roAnd the obstacle avoidance distance d of the y-agentoThe distance is divided into 4 shares, each representing a distance state, which is defined as follows:
Figure BDA0002596375220000044
wherein r isa,doThe following relationships are satisfied:
do<do+Δ<ra
Δ is an offset with respect to raIs small, i.e. delta < ra
S9, designing a state space for collaborative driving: the x-group driving y-group is a cooperative process, so the state quantity of the learning algorithm is determined by the motion states of the x-agent and the adjacent additional y-agent, and in order to realize a cooperative driving mode, the state space is designed into the following expression:
si=[θy1,d12,d2,...θk,dk];
the expression is a state expression when k y-agents are detected; thetayIs the angular deviation of the x _ group cluster from the destination, θiAnd diRespectively detecting the included angle and the distance between the ith x-agent and the y-agent, wherein theta belongs to Ang, and d belongs to Dis;
s10, designing a behavior space of the x-agent:
Ai=[1,2,3,...,32,33];
wherein 1 to 32 represent different attack directions, and correspond to the table values in S8 one by one, wherein 1 represents 0, 2 represents pi/16, 3 represents 2 pi/16, …,32 represents 31 pi/16, 33 represents x-agent is static, and the speed is 0;
s11, designing a reward and penalty mechanism:
Figure BDA0002596375220000051
in the formula dtyIndicating the distance of the y-agent from the destination,yto allow error value, when the distance between the x-agent and the y-agent is less than doThen a negative reward will be obtained, since the aim is to drive the y-group to the destination while maintaining the cluster topology of the y-group, so the distance is less than doIs not expected to occur; when the direction of motion of the y-group is directed to the destination, this is the behavior that is expected to be seen, thus giving a positive reward; when the y-group reaches the destination, the whole process is finished, and a large return is given, wherein the return indicates the correctness of the strategy of the whole process;
s12, a Q-learning algorithm is a value-based algorithm in a reinforcement learning algorithm, wherein Q is Q (S, a), namely, under the S state at a certain moment, action a is taken to obtain the expectation of income, wherein S belongs to S, S represents a state set, a belongs to A, and A represents the action set;
the environment feeds back a corresponding reward (r) according to the Action of the agent, and the main idea of the algorithm is to construct a Q-table by State and Action to store a Q value, and then to select the Action capable of obtaining the maximum benefit according to the Q value. The Q-value table update function is given as follows:
Figure BDA0002596375220000061
where k represents the kth training, α is the learning rate, γ is the discount factor, ai' denotes the next action, si' is the next state, riIs a reported value.
Preferably, the variable p in the step S1 is in three-dimensional spacei、viAnd uiThe dimensions of (a) are all 3.
In the step S4, in the above step,
Figure BDA0002596375220000062
for ensuring the stability of the internal topology of the cluster,
Figure BDA0002596375220000063
the obstacle can be avoided,
Figure BDA0002596375220000064
determining the movement direction of the g-agent in the following way:
Figure BDA0002596375220000065
Figure BDA0002596375220000066
Figure BDA0002596375220000067
wherein the content of the first and second substances,
Figure BDA0002596375220000068
and
Figure BDA0002596375220000069
all are positive constants which are set control coefficients; p is a radical ofrAnd vrIs the location and direction of the destination, wherein
φα(z)=ρh(z/rα)φ(z-dα)
Figure BDA00025963752200000610
Figure BDA00025963752200000611
ρh(z) is in phiαThe design in (z) is to ensure the smoothness of the potential energy function, and the potential energy is an integration process:
Figure BDA00025963752200000612
to ensure that the norm is differentiable everywhere, we define a sigma norm:
Figure BDA00025963752200000613
differentiating the sigma norm
Figure BDA00025963752200000614
In particular when 1, there are:
Figure BDA0002596375220000071
from the definition of the sigma norm, the following parameters are obtained:
rα=||ra||σ
dα=||d||σ
nij=σ(pj-pi)。
the invention has the beneficial effects that: the invention trains and learns the cluster control algorithm by means of the Q-Learning technology, effectively improves the cluster operation efficiency, maximizes benefits and ensures the stability of the cluster.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a schematic diagram illustrating a process of avoiding x-agent by y-agent in the embodiment;
FIG. 3 is a schematic diagram of the division of relative discretized polar coordinates;
FIG. 4 is a flow chart of a simulation process.
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following.
As shown in fig. 1, a cluster cooperative countermeasure method based on Q-Learning includes the following steps:
s1, describing a dynamic system of an intelligent agent in a cluster as a second-order integral system as follows:
Figure BDA0002596375220000072
wherein p isiFor the location of the ith agent in the cluster, viSpeed, u, of the ith agent in the clusteriThe acceleration of the ith agent in the cluster is the control input, and n is the total number of the agents in the cluster; wherein
Figure BDA0002596375220000073
And
Figure BDA0002596375220000074
represents p fori、viDerivation is carried out;
s2, when the distance between two intelligent agents in the cluster is smaller than the communication distance, the two intelligent agents are considered to be connected, and the position and the speed are shared, and the neighbor set of the ith intelligent agent in the cluster is described as follows:
Figure BDA0002596375220000075
wherein V represents a set of agents; r represents the communication distance between agent points, | | | · | |, is a euclidean norm;
s3, two clusters with a mutual counter relationship are set, an agent in the first cluster is an x _ agent, an agent in the second cluster is a y _ agent, the stability of the clusters is still kept when the y _ agent avoids the catching process of the x _ agent, and the x _ agent makes an autonomous decision;
Figure BDA0002596375220000081
respectively representing the position, speed and control input of the ith x _ agent; in the same way, order
Figure BDA0002596375220000082
Respectively representing the position, speed and control input of the ith y _ agent;
the motion process of the ith x _ agent is described by the following equation:
Figure BDA0002596375220000083
wherein the content of the first and second substances,
Figure BDA0002596375220000084
presentation pair
Figure BDA0002596375220000085
The derivation is carried out by the derivation,
Figure BDA0002596375220000086
presentation pair
Figure BDA0002596375220000087
Derivative, fRL(. is an implicit expression of QL, siIs a state variable of QL, QL represents Q-Learning,
Figure BDA0002596375220000088
is the desired speed, fe(. is a speed control function;
Figure BDA0002596375220000089
also referred to as the desired attack rate if
Figure BDA00025963752200000810
If the magnitude of the x _ agent is constant, the attack speed is equal to the attack direction, and in order to reduce the learning state of the x _ agent and accelerate the training speed of the algorithm, the advancing direction needs to be discretized;
let us assume that the group of x _ agent is x _ group and the group of y _ agent is y _ group, and during the process of avoiding x _ group, the direction of y _ agent is mainly determined by the attack direction of x _ agent, and the same discrete quantization operation is performed on both the avoidance direction of y _ agent and the attack direction of x _ agent in order to match the generation of Q learning state of x _ agent; the packing algorithm takes the avoidance speed as input to obtain the control input of y _ agent; the process of the ith y _ agent is described as follows:
Figure BDA00025963752200000811
wherein f isa(. -) represents the y-agent avoidance algorithm, input PxAnd VxIs the detected x-agent's position and velocity,
Figure BDA00025963752200000814
indicating the location of the ith y-agent,
Figure BDA00025963752200000812
speed, output quantity of the ith y-agent
Figure BDA00025963752200000813
Is the desired escape speed, fF(.) is an implicit expression of the flooding algorithm;
s4, in the packing algorithm, setting alpha-agent to represent the y _ agent of the intelligent agent, beta-agent to represent the x-agent of the intelligent agent, and gamma-agent to represent the moving destination of the y-agent of the intelligent agent; generated from alpha-agent, beta-agent, gamma-agent, respectively
Figure BDA0002596375220000091
Figure BDA0002596375220000092
The total control force is calculated as follows:
Figure BDA0002596375220000093
Figure BDA0002596375220000094
for ensuring the stability of the internal topology of the cluster,
Figure BDA0002596375220000095
the avoidance of the y-agent is realized,
Figure BDA0002596375220000096
determining the movement direction of the y-agent;
s5, determining an obstacle avoidance mode:
the first detection range r of the x-agent in the y-agent0Within but not within the obstacle avoidance range d of the y-agent0In the method, because the distance is too far, the y-group clusters can complete collective obstacle avoidance without destroying the internal topological structures of the clusters to carry out respective obstacle avoidance, and in the obstacle avoidance mode, the y-agents have the same destination, and then the method goes to step S6;
secondly, the x-agent is in the obstacle avoidance range of the y-agent, and due to the fact that the distance is too short, if a collective obstacle avoidance mode is continuously adopted, the x-agent and the y-agent are likely to collide with each other; therefore, at this moment, the collective obstacle avoidance mode fails, and respective obstacle avoidance modes are adopted, in this mode, because the acting force of the x-agent on each y-agent in the cluster is different, the y-agents do not completely have the same movement direction, the original topological structure can be cracked, at this moment, the obstacle avoidance is carried out according to the formula definition of S4, according to the formula of S4,
Figure BDA0002596375220000097
for ensuring the stability of the topology inside the y-group cluster,
Figure BDA0002596375220000098
the y-agent can avoid the x-group,
Figure BDA0002596375220000099
determining the movement direction of the y-agent, which is the direction vertical to the movement direction of the x-agent;
s6, defining the distance between the y-group and the y-group as follows:
Figure BDA00025963752200000910
wherein
Figure BDA00025963752200000911
For the jth y-agent, min () represents the minimum function; the basic idea of cluster obstacle avoidance is as follows: after the y-group cluster detects the x-agent(ii) a The y-group cluster selects a direction vertical to the movement direction of the x-agent to move according to the movement direction of the x-agent, and the destination is calculated according to the selected movement direction; when the y-group cluster detects a plurality of x-agents, the selected motion direction is the weighted sum of the vectors of the motion directions of the x-agents, the weight value of the vector represents the threat degree of the x-agents and is determined by the distance between the x-agents and the y-agents;
s7, if the y-group cluster only detects one x-agent, the speed is
Figure BDA00025963752200000912
The evasion speed selected by the y-group cluster is
Figure BDA00025963752200000913
And
Figure BDA00025963752200000914
are respectively as
Figure BDA00025963752200000915
And
Figure BDA00025963752200000916
the unit vector of (2) has:
Figure BDA0002596375220000101
if k x-agents are detected, the cluster velocity is given by:
Figure BDA0002596375220000102
wherein, wykThe threat level for x-agent is calculated by the following formula:
Figure BDA0002596375220000103
wherein η is a normalization factor; as shown in FIG. 2, the process of the y-agent evading the x-agent is shown.
S8, designing relative polar coordinates: as shown in fig. 3, the angle of the entire plane of the polar coordinate is evenly divided into 32 to obtain an angle space;
Ang={0,π/16,2π/16,...,31π/16}
according to the detection distance r of the x-agent and the y-agenta、roAnd the obstacle avoidance distance d of the y-agentoThe distance is divided into 4 shares, each representing a distance state, which is defined as follows:
Figure BDA0002596375220000104
wherein r isa,doThe following relationships are satisfied:
do<do+Δ<ra
Δ is an offset with respect to raIs small, i.e. delta < ra
S9, designing a state space for collaborative driving: the driving of the y-agent by the x-agent1 is a collaborative process, so the state quantity of the learning algorithm is determined by the motion state of the x-agent and the adjacent additional y-agent, and in order to realize the collaborative driving mode, the state space is designed into the following expression:
si=[θy1,d12,d2,...θk,dk];
the expression is a state expression when k y-agents are detected; thetayIs the angular deviation of the y-group cluster from the destination, θiAnd diRespectively detecting the included angle and the distance between the ith x-agent and the y-agent, wherein theta belongs to Ang, and d belongs to Dis;
s10, designing a behavior space of the x-agent:
Ai=[1,2,3,...,32,33];
wherein 1 to 32 represent different attack directions, and correspond to the table values in S8, wherein 1 represents 0, 2 represents pi/16, 3 represents 2 pi/16, …,32 represents 31 pi/16, and 33 represents x-agent rest;
s11, designing a reward and penalty mechanism:
Figure BDA0002596375220000111
in the formula dtyIndicating the distance of the y-agent from the destination,yto allow error value, when the distance between the x-agent and the y-agent is less than doThere will be a negative reward, since the aim is to drive the y-agent to the destination while maintaining the cluster topology of the y-agent, so the distance is less than doIs not expected to occur; when the direction of movement of the y-agent points to the destination, this is the behavior that is expected to be seen, thus giving a positive reward; when the y-agent reaches the destination, indicating that the whole process is finished, giving a larger return, wherein the return indicates the correctness of the strategy of the whole process;
s12, a Q-learning algorithm is a value-based algorithm in a reinforcement learning algorithm, wherein Q is Q (S, a), namely, under the S state at a certain moment, action a is taken to obtain the expectation of income, wherein S belongs to S, S represents a state set, a belongs to A, and A represents the action set;
the environment feeds back a corresponding reward (r) according to the Action of the agent, and the main idea of the algorithm is to construct a Q-table by State and Action to store a Q value, and then to select the Action capable of obtaining the maximum benefit according to the Q value. The Q-value table update function is given as follows:
Figure BDA0002596375220000121
where k represents the kth training, α is the learning rate, γ is the discount factor, ai' denotes the next action, si' is the next state, riIs a reported value.
In the three-dimensional space, the variable p in the step S1i、viAnd uiThe dimensions of (a) are all 3.
In the step S4, in the above step,
Figure BDA0002596375220000122
for protectingThe stability of the internal topology of the cluster is proved,
Figure BDA0002596375220000123
the obstacle can be avoided,
Figure BDA0002596375220000124
determining the movement direction of the g-agent in the following way:
Figure BDA0002596375220000125
Figure BDA0002596375220000126
Figure BDA0002596375220000127
wherein the content of the first and second substances,
Figure BDA0002596375220000128
and
Figure BDA0002596375220000129
all are positive constants which are set control coefficients; p is a radical ofrAnd vrIs the location and direction of the destination, wherein
φα(z)=ρh(z/rα)φ(z-dα)
Figure BDA00025963752200001210
Figure BDA00025963752200001211
ρh(z) is in phiαThe design in (z) is to ensure the smoothness of the potential energy function, and the potential energy is an integration process:
Figure BDA00025963752200001212
to ensure that the norm is differentiable everywhere, we define a sigma norm:
Figure BDA00025963752200001213
differentiating the sigma norm
Figure BDA00025963752200001214
In particular when 1, there are:
Figure BDA0002596375220000131
from the definition of the sigma norm, the following parameters are obtained:
rα=||ra||σ
dα=||d||σ
nij=σ(pj-pi)。
in the embodiment of the application, basic parameters of the simulation environment are set as follows: simulating according to the flow shown in FIG. 4, according to the coordinate size, the location of the destination, the number of x-agents and y-agents and the sensing range, creating a state space and a structure body array Q until all states are finished, and storing the training result in Q. In the training process, the value is gradually reduced, so that the traversed Q value is more accurate; the trained Q value is used for simulation, so that the process of mutual confrontation of the two agents can be realized under the environment, and the process is globally optimal.
The foregoing is a preferred embodiment of the present invention, it is to be understood that the invention is not limited to the form disclosed herein, but is not to be construed as excluding other embodiments, and is capable of other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (3)

1. A cluster cooperative countermeasure method based on Q-Learning is characterized in that: the method comprises the following steps:
s1, describing a dynamic system of an intelligent agent in a cluster as a second-order integral system as follows:
Figure FDA0002596375210000011
wherein p isiFor the location of the ith agent in the cluster, viSpeed, u, of the ith agent in the clusteriThe acceleration of the ith agent in the cluster is the control input, and n is the total number of the agents in the cluster; wherein
Figure FDA00025963752100000112
And
Figure FDA00025963752100000113
represents p fori、viDerivation is carried out;
s2, when the distance between two intelligent agents in the cluster is smaller than the communication distance, the two intelligent agents are considered to be connected, and the position and the speed are shared, and the neighbor set of the ith intelligent agent in the cluster is described as follows:
Ni a={j∈V:||pj-pi||≤r,j≠i};
wherein V represents a set of agents; r represents the communication distance between agent points, | | | · | |, is a euclidean norm;
s3, two clusters with a mutual counter relationship are set, an agent in the first cluster is an x _ agent, an agent in the second cluster is a y _ agent, the stability of the clusters is still kept when the y _ agent avoids the catching process of the x _ agent, and the x _ agent makes an autonomous decision;
Figure FDA0002596375210000012
respectively representing the position, speed and control input of the ith x _ agent; in the same way, order
Figure FDA0002596375210000013
Respectively representing the position, speed and control input of the ith y _ agent;
the motion process of the ith x _ agent is described by the following equation:
Figure FDA0002596375210000014
wherein the content of the first and second substances,
Figure FDA0002596375210000015
presentation pair
Figure FDA0002596375210000016
The derivation is carried out by the derivation,
Figure FDA0002596375210000017
presentation pair
Figure FDA0002596375210000018
Derivative, fQL(. is an implicit expression of QL, siIs a state variable of QL, QL represents Q-Learning,
Figure FDA0002596375210000019
is the desired speed, fe(. is a speed control function;
Figure FDA00025963752100000110
also referred to as the desired attack rate if
Figure FDA00025963752100000111
If the magnitude of the x _ agent is constant, the attack speed is equal to the attack direction, and in order to reduce the learning state of the x _ agent and accelerate the training speed of the algorithm, the advancing direction needs to be discretized;
let us assume that the group of x _ agent is x _ group and the group of y _ agent is y _ group, and during the process of avoiding x _ group, the direction of y _ agent is mainly determined by the attack direction of x _ agent, and the same discrete quantization operation is performed on both the avoidance direction of y _ agent and the attack direction of x _ agent in order to match the generation of Q learning state of x _ agent; the packing algorithm takes the avoidance speed as input to obtain the control input of y _ agent; the process of the ith y _ agent is described as follows:
Figure FDA0002596375210000021
wherein f isa(. -) represents the y-agent avoidance algorithm, input PxAnd VxIs the detected x-agent's position and velocity,
Figure FDA0002596375210000022
indicating the location of the ith y-agent,
Figure FDA0002596375210000023
speed, output quantity of the ith y-agent
Figure FDA0002596375210000024
Is the desired escape speed, fF(.) is an implicit expression of the flooding algorithm;
s4, in the packing algorithm, setting alpha-agent to represent the y _ agent of the intelligent agent, beta-agent to represent the x-agent of the intelligent agent, and gamma-agent to represent the moving destination of the y-agent of the intelligent agent; generated from alpha-agent, beta-agent, gamma-agent, respectively
Figure FDA0002596375210000025
Figure FDA0002596375210000026
The total control force is calculated as follows:
Figure FDA0002596375210000027
Figure FDA0002596375210000028
for ensuring the stability of the internal topology of the cluster,
Figure FDA0002596375210000029
the avoidance of the y-agent is realized,
Figure FDA00025963752100000210
determining the movement direction of the y-agent;
s5, determining an obstacle avoidance mode:
the first detection range r of the x-agent in the y-agent0Within but not within the obstacle avoidance range d of the y-agent0In the method, because the distance is too far, the y-group clusters can complete collective obstacle avoidance without destroying the internal topological structures of the clusters to carry out respective obstacle avoidance, and in the obstacle avoidance mode, the y-agents have the same destination, and then the method goes to step S6;
secondly, the x-agent is in the obstacle avoidance range of the y-agent, and due to the fact that the distance is too short, if a collective obstacle avoidance mode is continuously adopted, the x-agent and the y-agent are likely to collide with each other; therefore, at this moment, the collective obstacle avoidance mode fails, and respective obstacle avoidance modes are adopted, in this mode, because the acting force of the x-agent on each y-agent in the cluster is different, the y-agents do not completely have the same movement direction, the original topological structure can be cracked, at this moment, the obstacle avoidance is carried out according to the formula definition of S4, according to the formula of S4,
Figure FDA00025963752100000211
for ensuring the stability of the topology inside the y-group cluster,
Figure FDA00025963752100000212
the y-agent can avoid the x-group,
Figure FDA00025963752100000213
determining the movement direction of the y-agent, which is the direction vertical to the movement direction of the x-agent;
s6, defining the distance between the x-group and the y-group as follows:
Figure FDA00025963752100000214
wherein
Figure FDA00025963752100000215
For the jth y-agent, min () represents the minimum function; the basic idea of cluster obstacle avoidance is as follows: after the y-group cluster detects the x-group; the y-agent selects a direction vertical to the movement direction of the x-agent to move according to the detected movement direction of the x-agent, and the destination is calculated according to the selected movement direction; when one y-agent detects a plurality of x-agents, the selected motion direction is the weighted sum of the vectors of the motion directions of the plurality of x-agents, the weight value of the motion direction represents the threat degree of the x-group and is determined by the distance between the x-agent and the y-agent;
s7, if the y-group cluster only detects one x-agent, the speed is
Figure FDA0002596375210000031
The evasion speed selected by the y-group cluster is
Figure FDA0002596375210000032
And
Figure FDA0002596375210000033
are respectively as
Figure FDA0002596375210000034
And
Figure FDA0002596375210000035
the unit vector of (2) has:
Figure FDA0002596375210000036
if k x-agents are detected, the cluster velocity is given by:
Figure FDA0002596375210000037
wherein, wykThe threat level for x-agent is calculated by the following formula:
Figure FDA0002596375210000038
wherein η is a normalization factor;
s8, designing relative polar coordinates:
evenly dividing the angle of the whole plane of the polar coordinate into 32 parts to obtain an angle space
Ang={0,π/16,2π/16,...,31π/16}
According to the detection distance r of the x-agent and the y-agenta、roAnd the obstacle avoidance distance d of the y-agentoThe distance is divided into 4 shares, each representing a distance state, which is defined as follows:
Figure FDA0002596375210000041
wherein r isa,doThe following relationships are satisfied:
do<do+Δ<ra
Δ is an offset with respect to raIs small, i.e. delta < ra
S9, designing a state space for collaborative driving: the x-group driving y-group is a cooperative process, so the state quantity of the learning algorithm is determined by the motion states of the x-agent and the adjacent additional y-agent, and in order to realize a cooperative driving mode, the state space is designed into the following expression:
si=[θy1,d12,d2,...θk,dk];
the expression is a state expression when k y-agents are detected; thetayIs the angular deviation of the x _ group cluster from the destination, θiAnd diRespectively detecting the included angle and the distance between the ith x-agent and the y-agent, wherein theta belongs to Ang, and d belongs to Dis;
s10, designing a behavior space of the x-agent:
Ai=[1,2,3,...,32,33];
wherein 1 to 32 represent different attack directions, and correspond to the table values in S8 one by one, wherein 1 represents 0, 2 represents pi/16, 3 represents 2 pi/16, …,32 represents 31 pi/16, 33 represents x-agent is static, and the speed is 0;
s11, designing a reward and penalty mechanism:
Figure FDA0002596375210000042
in the formula dtyIndicating the distance of the y-agent from the destination,yto allow error value, when the distance between the x-agent and the y-agent is less than doThen a negative reward will be obtained, since the aim is to drive the y-group to the destination while maintaining the cluster topology of the y-group, so the distance is less than doIs not expected to occur; when the direction of motion of the y-group is directed to the destination, this is the behavior that is expected to be seen, thus giving a positive reward; when the y-group reaches the destination, the whole process is finished, and a large return is given, wherein the return indicates the correctness of the strategy of the whole process;
s12, a Q-learning algorithm is a value-based algorithm in a reinforcement learning algorithm, wherein Q is Q (S, a), namely, under the S state at a certain moment, action a is taken to obtain the expectation of income, wherein S belongs to S, S represents a state set, a belongs to A, and A represents the action set;
the environment feeds back a corresponding reward (r) according to the Action of the agent, and the main idea of the algorithm is to construct a Q-table by State and Action to store a Q value, and then to select the Action capable of obtaining the maximum benefit according to the Q value. The Q-value table update function is given as follows:
Figure FDA0002596375210000051
where k represents the kth training, α is the learning rate, γ is the discount factor, ai' denotes the next action, si' is the next state, riIs a reported value.
2. The Q-Learning based cluster cooperative countermeasure method of claim 1, wherein: in the three-dimensional space, the variable p in the step S1i、viAnd uiThe dimensions of (a) are all 3.
3. The Q-Learning based cluster cooperative countermeasure method of claim 1, wherein: in the step S4, in the above step,
Figure FDA0002596375210000052
for ensuring the stability of the internal topology of the cluster,
Figure FDA0002596375210000053
the obstacle can be avoided,
Figure FDA0002596375210000054
determining the movement direction of the g-agent in the following way:
Figure FDA0002596375210000055
Figure FDA0002596375210000056
Figure FDA0002596375210000057
wherein the content of the first and second substances,
Figure FDA0002596375210000058
and
Figure FDA0002596375210000059
all are positive constants which are set control coefficients; p is a radical ofrAnd vrIs the location and direction of the destination, wherein
φα(z)=ρh(z/rα)φ(z-dα)
Figure FDA00025963752100000510
Figure FDA00025963752100000511
ρh(z) is in phiαThe design in (z) is to ensure the smoothness of the potential energy function, and the potential energy is an integration process:
Figure FDA0002596375210000061
to ensure that the norm is differentiable everywhere, we define a sigma norm:
Figure FDA0002596375210000062
differentiating the sigma norm
Figure FDA0002596375210000063
In particular when 1, there are:
Figure FDA0002596375210000064
from the definition of the sigma norm, the following parameters are obtained:
rα=||ra||σ
dα=||d||σ
nij=σ(pj-pi)。
CN202010710580.9A 2020-07-22 2020-07-22 Q-Learning-based cluster cooperative countermeasure method Pending CN111880565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010710580.9A CN111880565A (en) 2020-07-22 2020-07-22 Q-Learning-based cluster cooperative countermeasure method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010710580.9A CN111880565A (en) 2020-07-22 2020-07-22 Q-Learning-based cluster cooperative countermeasure method

Publications (1)

Publication Number Publication Date
CN111880565A true CN111880565A (en) 2020-11-03

Family

ID=73155229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010710580.9A Pending CN111880565A (en) 2020-07-22 2020-07-22 Q-Learning-based cluster cooperative countermeasure method

Country Status (1)

Country Link
CN (1) CN111880565A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112947581A (en) * 2021-03-25 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN113033756A (en) * 2021-03-25 2021-06-25 重庆大学 Multi-agent control method based on target-oriented aggregation strategy
CN113156954A (en) * 2021-04-25 2021-07-23 电子科技大学 Multi-agent cluster obstacle avoidance method based on reinforcement learning
CN113359824A (en) * 2021-05-31 2021-09-07 杭州电子科技大学 Unmanned aerial vehicle cluster control method based on fuzzy model
CN113885576A (en) * 2021-10-29 2022-01-04 南京航空航天大学 Unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning
CN114326749A (en) * 2022-01-11 2022-04-12 电子科技大学长三角研究院(衢州) Deep Q-Learning-based cluster area coverage method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖剑: "基于增强学习的Flocking集群协同控制算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112947581A (en) * 2021-03-25 2021-06-11 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN113033756A (en) * 2021-03-25 2021-06-25 重庆大学 Multi-agent control method based on target-oriented aggregation strategy
CN112947581B (en) * 2021-03-25 2022-07-05 西北工业大学 Multi-unmanned aerial vehicle collaborative air combat maneuver decision method based on multi-agent reinforcement learning
CN113033756B (en) * 2021-03-25 2022-09-16 重庆大学 Multi-agent control method based on target-oriented aggregation strategy
CN113156954A (en) * 2021-04-25 2021-07-23 电子科技大学 Multi-agent cluster obstacle avoidance method based on reinforcement learning
CN113359824A (en) * 2021-05-31 2021-09-07 杭州电子科技大学 Unmanned aerial vehicle cluster control method based on fuzzy model
CN113885576A (en) * 2021-10-29 2022-01-04 南京航空航天大学 Unmanned aerial vehicle formation environment establishment and control method based on deep reinforcement learning
CN114326749A (en) * 2022-01-11 2022-04-12 电子科技大学长三角研究院(衢州) Deep Q-Learning-based cluster area coverage method
CN114326749B (en) * 2022-01-11 2023-10-13 电子科技大学长三角研究院(衢州) Deep Q-Learning-based cluster area coverage method

Similar Documents

Publication Publication Date Title
CN111880565A (en) Q-Learning-based cluster cooperative countermeasure method
CN110632931B (en) Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN110703766B (en) Unmanned aerial vehicle path planning method based on transfer learning strategy deep Q network
Zhu et al. Task assignment and path planning of a multi-AUV system based on a Glasius bio-inspired self-organising map algorithm
CN111780777A (en) Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning
Wang et al. A survey of underwater search for multi-target using Multi-AUV: Task allocation, path planning, and formation control
CN110347181B (en) Energy consumption-based distributed formation control method for unmanned aerial vehicles
CN111176122B (en) Underwater robot parameter self-adaptive backstepping control method based on double BP neural network Q learning technology
CN111240345A (en) Underwater robot trajectory tracking method based on double BP network reinforcement learning framework
Cao et al. Hunting algorithm for multi-auv based on dynamic prediction of target trajectory in 3d underwater environment
CN108919818B (en) Spacecraft attitude orbit collaborative planning method based on chaotic population variation PIO
CN109947131A (en) A kind of underwater multi-robot formation control method based on intensified learning
CN115993781B (en) Network attack resistant unmanned cluster system cooperative control method, terminal and storage medium
CN109540163A (en) A kind of obstacle-avoiding route planning algorithm combined based on differential evolution and fuzzy control
CN113156954A (en) Multi-agent cluster obstacle avoidance method based on reinforcement learning
CN114089776A (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN114518770A (en) Unmanned aerial vehicle path planning method integrating potential field and deep reinforcement learning
CN114003059A (en) UAV path planning method based on deep reinforcement learning under kinematic constraint condition
CN113759935B (en) Intelligent group formation mobile control method based on fuzzy logic
Yan et al. Flocking and collision avoidance for a dynamic squad of fixed-wing UAVs using deep reinforcement learning
Nabizadeh et al. A multi-swarm cellular PSO based on clonal selection algorithm in dynamic environments
Chen et al. A multirobot cooperative area coverage search algorithm based on bioinspired neural network in unknown environments
CN112306097A (en) Novel unmanned aerial vehicle path planning method
CN116954258A (en) Hierarchical control method and device for multi-four-rotor unmanned aerial vehicle formation under unknown disturbance
CN115542921A (en) Autonomous path planning method for multiple robots

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20201103