CN111880565A

CN111880565A - Q-Learning-based cluster cooperative countermeasure method

Info

Publication number: CN111880565A
Application number: CN202010710580.9A
Authority: CN
Inventors: 王刚; 肖剑; 薛玉玺; 黄治宇; 田新宇; 孙奇; 成雷; 王钰瑶
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-11-03

Abstract

The invention discloses a cluster cooperative countermeasure method based on Q-Learning, which comprises the following steps: giving out a dynamic system of the agents in the cluster; determining a neighbor set of agents in a cluster; giving the motion process of two clusters in a mutually counterbalancing relationship; calculating control force in a packing algorithm; determining an obstacle avoidance mode; selecting an obstacle avoidance mode, and defining the distance between clusters; determining a cluster speed; designing relative polar coordinates; designing a state space for cooperative driving; designing a behavior space of a cluster; designing a reward and penalty mechanism; for the Q-learning algorithm, its Q-value table update function is given. The invention trains and learns the cluster control algorithm by means of the Q-Learning technology, effectively improves the cluster operation efficiency, maximizes benefits and ensures the stability of the cluster.

Description

Q-Learning-based cluster cooperative countermeasure method

Technical Field

The invention belongs to the field of multi-agent clustering and Q-Learning, and particularly relates to a Q-Learning-based cluster cooperative countermeasure method.

Background

Colonial antagonism is a common phenomenon, for example, the predation of other fish by shark populations and predators by carnivores in the ocean. In recent years, with the rise of artificial intelligence, the field of intelligent control has become a popular research field, and significant progress has been made in intelligent agents such as unmanned aerial vehicles, unmanned vehicles, and mobile robots. An agent is an individual with autonomic behavior and perception capabilities. Similarly, a multi-agent system is a system that is composed of a plurality of agents and can complete a certain task.

The existing clustering technologies mainly include two types, which are mainly classified into cluster formation control and cluster search control.

The cluster formation control mainly controls each intelligent agent to move according to a preset route and a preset formation to complete a set task, and the stability and the robustness among the intelligent agents are kept in the process. Such as: and performing cluster performance of the unmanned aerial vehicle cluster. The cluster search control mainly controls the agents to search a certain area to be detected, and realizes the maximization of the search area in the shortest time. However, the current clustering technology is still insufficient in terms of operation efficiency and benefit maximization design.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a cluster cooperative countermeasure method based on Q-Learning, which can effectively improve the cluster operation efficiency, maximize benefits and ensure the cluster stability.

The purpose of the invention is realized by the following technical scheme: a cluster cooperative countermeasure method based on Q-Learning comprises the following steps:

s1, describing a dynamic system of an intelligent agent in a cluster as a second-order integral system as follows:

wherein p is_iFor the location of the ith agent in the cluster, v_iSpeed, u, of the ith agent in the cluster_iThe acceleration of the ith agent in the cluster is the control input, and n is the total number of the agents in the cluster; wherein

And

represents p for_i、v_iDerivation is carried out;

s2, when the distance between two intelligent agents in the cluster is smaller than the communication distance, the two intelligent agents are considered to be connected, and the position and the speed are shared, and the neighbor set of the ith intelligent agent in the cluster is described as follows:

N_i ^a＝{j∈V:||p_j-p_i||≤r,j≠i}；

s3, two clusters with a mutual counter relationship are set, an agent in the first cluster is an x _ agent, an agent in the second cluster is a y _ agent, the stability of the clusters is still kept when the y _ agent avoids the catching process of the x _ agent, and the x _ agent makes an autonomous decision;

respectively representing the position, speed and control input of the ith x _ agent; in the same way, order

Respectively representing the position, speed and control input of the ith y _ agent;

the motion process of the ith x _ agent is described by the following equation:

wherein the content of the first and second substances,

presentation pair

The derivation is carried out by the derivation,

presentation pair

Derivative, f_QL(. is an implicit expression of QL, s_iIs a state variable of QL, QL represents Q-Learning,

is the desired speed, f_e(. is a speed control function;

also referred to as the desired attack rate if

If the magnitude of the x _ agent is constant, the attack speed is equal to the attack direction, and in order to reduce the learning state of the x _ agent and accelerate the training speed of the algorithm, the advancing direction needs to be discretized;

let us assume that the group of x _ agent is x _ group and the group of y _ agent is y _ group, and during the process of avoiding x _ group, the direction of y _ agent is mainly determined by the attack direction of x _ agent, and the same discrete quantization operation is performed on both the avoidance direction of y _ agent and the attack direction of x _ agent in order to match the generation of Q learning state of x _ agent; the packing algorithm takes the avoidance speed as input to obtain the control input of y _ agent; the process of the ith y _ agent is described as follows:

wherein f is_a(. -) represents the y-agent avoidance algorithm, input P_xAnd V_xIs the detected x-agent's position and velocity,

indicating the location of the ith y-agent,

speed, output quantity of the ith y-agent

Is the desired escape speed, f_F(.) is an implicit expression of the flooding algorithm;

s4, in the packing algorithm, setting alpha-agent to represent the y _ agent of the intelligent agent, beta-agent to represent the x-agent of the intelligent agent, and gamma-agent to represent the moving destination of the y-agent of the intelligent agent; generated from alpha-agent, beta-agent, gamma-agent, respectively

The total control force is calculated as follows:

for ensuring the stability of the internal topology of the cluster,

the avoidance of the y-agent is realized,

determining the movement direction of the y-agent;

s5, determining an obstacle avoidance mode:

the first detection range r of the x-agent in the y-agent₀Within but not within the obstacle avoidance range d of the y-agent₀In the method, because the distance is too far, the y-group clusters can complete collective obstacle avoidance without destroying the internal topological structures of the clusters to carry out respective obstacle avoidance, and in the obstacle avoidance mode, the y-agents have the same destination, and then the method goes to step S6;

secondly, the x-agent is in the obstacle avoidance range of the y-agent, and due to the fact that the distance is too short, if a collective obstacle avoidance mode is continuously adopted, the x-agent and the y-agent are likely to collide with each other; therefore, at this moment, the collective obstacle avoidance mode fails, and respective obstacle avoidance modes are adopted, in this mode, because the acting force of the x-agent on each y-agent in the cluster is different, the y-agents do not completely have the same movement direction, the original topological structure can be cracked, at this moment, the obstacle avoidance is carried out according to the formula definition of S4, according to the formula of S4,

for ensuring the stability of the topology inside the y-group cluster,

the y-agent can avoid the x-group,

determining the movement direction of the y-agent, which is the direction vertical to the movement direction of the x-agent;

s6, defining the distance between the x-group and the y-group as follows:

wherein

For the jth y-agent, min () represents the minimum function; the basic idea of cluster obstacle avoidance is as follows: after the y-group cluster detects the x-group; the y-agent will select the x-agent perpendicular to the x-agent based on the detected direction of movement of the x-agentMoving in the direction of the moving direction, and calculating the destination according to the selected moving direction; when one y-agent detects a plurality of x-agents, the selected motion direction is the weighted sum of the vectors of the motion directions of the plurality of x-agents, the weight value of the motion direction represents the threat degree of the x-group and is determined by the distance between the x-agent and the y-agent;

s7, if the y-group cluster only detects one x-agent, the speed is

The evasion speed selected by the y-group cluster is

And

are respectively as

And

the unit vector of (2) has:

if k x-agents are detected, the cluster velocity is given by:

wherein, w_ykThe threat level for x-agent is calculated by the following formula:

wherein η is a normalization factor;

s8, designing relative polar coordinates:

evenly dividing the angle of the whole plane of the polar coordinate into 32 parts to obtain an angle space

Ang＝{0,π/16,2π/16,...,31π/16}

According to the detection distance r of the x-agent and the y-agent_a、r_oAnd the obstacle avoidance distance d of the y-agent_oThe distance is divided into 4 shares, each representing a distance state, which is defined as follows:

wherein r is_a，d_oThe following relationships are satisfied:

d_o＜d_o+Δ＜r_a

Δ is an offset with respect to r_aIs small, i.e. delta < r_a；

S9, designing a state space for collaborative driving: the x-group driving y-group is a cooperative process, so the state quantity of the learning algorithm is determined by the motion states of the x-agent and the adjacent additional y-agent, and in order to realize a cooperative driving mode, the state space is designed into the following expression:

s_i＝[θ_y,θ₁,d₁,θ₂,d₂,...θ_k,d_k]；

the expression is a state expression when k y-agents are detected; theta_yIs the angular deviation of the x _ group cluster from the destination, θ_iAnd d_iRespectively detecting the included angle and the distance between the ith x-agent and the y-agent, wherein theta belongs to Ang, and d belongs to Dis;

s10, designing a behavior space of the x-agent:

A_i＝[1,2,3,...,32,33]；

wherein 1 to 32 represent different attack directions, and correspond to the table values in S8 one by one, wherein 1 represents 0, 2 represents pi/16, 3 represents 2 pi/16, …,32 represents 31 pi/16, 33 represents x-agent is static, and the speed is 0;

s11, designing a reward and penalty mechanism:

in the formula d_tyIndicating the distance of the y-agent from the destination,_yto allow error value, when the distance between the x-agent and the y-agent is less than d_oThen a negative reward will be obtained, since the aim is to drive the y-group to the destination while maintaining the cluster topology of the y-group, so the distance is less than d_oIs not expected to occur; when the direction of motion of the y-group is directed to the destination, this is the behavior that is expected to be seen, thus giving a positive reward; when the y-group reaches the destination, the whole process is finished, and a large return is given, wherein the return indicates the correctness of the strategy of the whole process;

s12, a Q-learning algorithm is a value-based algorithm in a reinforcement learning algorithm, wherein Q is Q (S, a), namely, under the S state at a certain moment, action a is taken to obtain the expectation of income, wherein S belongs to S, S represents a state set, a belongs to A, and A represents the action set;

the environment feeds back a corresponding reward (r) according to the Action of the agent, and the main idea of the algorithm is to construct a Q-table by State and Action to store a Q value, and then to select the Action capable of obtaining the maximum benefit according to the Q value. The Q-value table update function is given as follows:

where k represents the kth training, α is the learning rate, γ is the discount factor, a_i' denotes the next action, s_i' is the next state, r_iIs a reported value.

Preferably, the variable p in the step S1 is in three-dimensional space_i、v_iAnd u_iThe dimensions of (a) are all 3.

In the step S4, in the above step,

for ensuring the stability of the internal topology of the cluster,

the obstacle can be avoided,

determining the movement direction of the g-agent in the following way:

wherein the content of the first and second substances,

and

all are positive constants which are set control coefficients; p is a radical of_rAnd v_rIs the location and direction of the destination, wherein

φ_α(z)＝ρ_h(z/r_α)φ(z-d_α)

ρ_h(z) is in phi_αThe design in (z) is to ensure the smoothness of the potential energy function, and the potential energy is an integration process:

to ensure that the norm is differentiable everywhere, we define a sigma norm:

differentiating the sigma norm

In particular when 1, there are:

from the definition of the sigma norm, the following parameters are obtained:

r_α＝||r_a||_σ

d_α＝||d||_σ

n_ij＝σ(p_j-p_i)。

the invention has the beneficial effects that: the invention trains and learns the cluster control algorithm by means of the Q-Learning technology, effectively improves the cluster operation efficiency, maximizes benefits and ensures the stability of the cluster.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic diagram illustrating a process of avoiding x-agent by y-agent in the embodiment;

FIG. 3 is a schematic diagram of the division of relative discretized polar coordinates;

FIG. 4 is a flow chart of a simulation process.

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following.

As shown in fig. 1, a cluster cooperative countermeasure method based on Q-Learning includes the following steps:

And

represents p for_i、v_iDerivation is carried out;

the motion process of the ith x _ agent is described by the following equation:

wherein the content of the first and second substances,

presentation pair

The derivation is carried out by the derivation,

presentation pair

Derivative, f_RL(. is an implicit expression of QL, s_iIs a state variable of QL, QL represents Q-Learning,

is the desired speed, f_e(. is a speed control function;

also referred to as the desired attack rate if

indicating the location of the ith y-agent,

speed, output quantity of the ith y-agent

The total control force is calculated as follows:

for ensuring the stability of the internal topology of the cluster,

the avoidance of the y-agent is realized,

determining the movement direction of the y-agent;

s5, determining an obstacle avoidance mode:

for ensuring the stability of the topology inside the y-group cluster,

the y-agent can avoid the x-group,

s6, defining the distance between the y-group and the y-group as follows:

wherein

For the jth y-agent, min () represents the minimum function; the basic idea of cluster obstacle avoidance is as follows: after the y-group cluster detects the x-agent(ii) a The y-group cluster selects a direction vertical to the movement direction of the x-agent to move according to the movement direction of the x-agent, and the destination is calculated according to the selected movement direction; when the y-group cluster detects a plurality of x-agents, the selected motion direction is the weighted sum of the vectors of the motion directions of the x-agents, the weight value of the vector represents the threat degree of the x-agents and is determined by the distance between the x-agents and the y-agents;

s7, if the y-group cluster only detects one x-agent, the speed is

The evasion speed selected by the y-group cluster is

And

are respectively as

And

the unit vector of (2) has:

if k x-agents are detected, the cluster velocity is given by:

wherein η is a normalization factor; as shown in FIG. 2, the process of the y-agent evading the x-agent is shown.

S8, designing relative polar coordinates: as shown in fig. 3, the angle of the entire plane of the polar coordinate is evenly divided into 32 to obtain an angle space;

Ang＝{0,π/16,2π/16,...,31π/16}

wherein r is_a，d_oThe following relationships are satisfied:

d_o＜d_o+Δ＜r_a

Δ is an offset with respect to r_aIs small, i.e. delta < r_a；

S9, designing a state space for collaborative driving: the driving of the y-agent by the x-agent1 is a collaborative process, so the state quantity of the learning algorithm is determined by the motion state of the x-agent and the adjacent additional y-agent, and in order to realize the collaborative driving mode, the state space is designed into the following expression:

s_i＝[θ_y,θ₁,d₁,θ₂,d₂,...θ_k,d_k]；

the expression is a state expression when k y-agents are detected; theta_yIs the angular deviation of the y-group cluster from the destination, θ_iAnd d_iRespectively detecting the included angle and the distance between the ith x-agent and the y-agent, wherein theta belongs to Ang, and d belongs to Dis;

s10, designing a behavior space of the x-agent:

A_i＝[1,2,3,...,32,33]；

wherein 1 to 32 represent different attack directions, and correspond to the table values in S8, wherein 1 represents 0, 2 represents pi/16, 3 represents 2 pi/16, …,32 represents 31 pi/16, and 33 represents x-agent rest;

s11, designing a reward and penalty mechanism:

in the formula d_tyIndicating the distance of the y-agent from the destination,_yto allow error value, when the distance between the x-agent and the y-agent is less than d_oThere will be a negative reward, since the aim is to drive the y-agent to the destination while maintaining the cluster topology of the y-agent, so the distance is less than d_oIs not expected to occur; when the direction of movement of the y-agent points to the destination, this is the behavior that is expected to be seen, thus giving a positive reward; when the y-agent reaches the destination, indicating that the whole process is finished, giving a larger return, wherein the return indicates the correctness of the strategy of the whole process;

In the three-dimensional space, the variable p in the step S1_i、v_iAnd u_iThe dimensions of (a) are all 3.

In the step S4, in the above step,

for protectingThe stability of the internal topology of the cluster is proved,

the obstacle can be avoided,

determining the movement direction of the g-agent in the following way:

wherein the content of the first and second substances,

and

φ_α(z)＝ρ_h(z/r_α)φ(z-d_α)

to ensure that the norm is differentiable everywhere, we define a sigma norm:

differentiating the sigma norm

In particular when 1, there are:

from the definition of the sigma norm, the following parameters are obtained:

r_α＝||r_a||_σ

d_α＝||d||_σ

n_ij＝σ(p_j-p_i)。

in the embodiment of the application, basic parameters of the simulation environment are set as follows: simulating according to the flow shown in FIG. 4, according to the coordinate size, the location of the destination, the number of x-agents and y-agents and the sensing range, creating a state space and a structure body array Q until all states are finished, and storing the training result in Q. In the training process, the value is gradually reduced, so that the traversed Q value is more accurate; the trained Q value is used for simulation, so that the process of mutual confrontation of the two agents can be realized under the environment, and the process is globally optimal.

The foregoing is a preferred embodiment of the present invention, it is to be understood that the invention is not limited to the form disclosed herein, but is not to be construed as excluding other embodiments, and is capable of other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A cluster cooperative countermeasure method based on Q-Learning is characterized in that: the method comprises the following steps:

And

represents p for_i、v_iDerivation is carried out;

N_i ^a＝{j∈V:||p_j-p_i||≤r,j≠i}；

the motion process of the ith x _ agent is described by the following equation:

wherein the content of the first and second substances,

presentation pair

The derivation is carried out by the derivation,

presentation pair

is the desired speed, f_e(. is a speed control function;

also referred to as the desired attack rate if

indicating the location of the ith y-agent,

speed, output quantity of the ith y-agent

The total control force is calculated as follows:

for ensuring the stability of the internal topology of the cluster,

the avoidance of the y-agent is realized,

determining the movement direction of the y-agent;

s5, determining an obstacle avoidance mode:

for ensuring the stability of the topology inside the y-group cluster,

the y-agent can avoid the x-group,

s6, defining the distance between the x-group and the y-group as follows:

wherein

For the jth y-agent, min () represents the minimum function; the basic idea of cluster obstacle avoidance is as follows: after the y-group cluster detects the x-group; the y-agent selects a direction vertical to the movement direction of the x-agent to move according to the detected movement direction of the x-agent, and the destination is calculated according to the selected movement direction; when one y-agent detects a plurality of x-agents, the selected motion direction is the weighted sum of the vectors of the motion directions of the plurality of x-agents, the weight value of the motion direction represents the threat degree of the x-group and is determined by the distance between the x-agent and the y-agent;

s7, if the y-group cluster only detects one x-agent, the speed is

The evasion speed selected by the y-group cluster is

And

are respectively as

And

the unit vector of (2) has:

if k x-agents are detected, the cluster velocity is given by:

wherein η is a normalization factor;

s8, designing relative polar coordinates:

Ang＝{0,π/16,2π/16,...,31π/16}

wherein r is_a，d_oThe following relationships are satisfied:

d_o＜d_o+Δ＜r_a

Δ is an offset with respect to r_aIs small, i.e. delta < r_a；

s_i＝[θ_y,θ₁,d₁,θ₂,d₂,...θ_k,d_k]；

s10, designing a behavior space of the x-agent:

A_i＝[1,2,3,...,32,33]；

s11, designing a reward and penalty mechanism:

2. The Q-Learning based cluster cooperative countermeasure method of claim 1, wherein: in the three-dimensional space, the variable p in the step S1_i、v_iAnd u_iThe dimensions of (a) are all 3.

3. The Q-Learning based cluster cooperative countermeasure method of claim 1, wherein: in the step S4, in the above step,