CN111880564A

CN111880564A - Multi-agent area searching method based on collaborative reinforcement learning

Info

Publication number: CN111880564A
Application number: CN202010710554.6A
Authority: CN
Inventors: 张瑛; 肖剑; 黄治宇; 薛玉玺; 吴磊; 靳一丹; 吴冰航
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-11-03

Abstract

The invention discloses a multi-agent area searching method based on collaborative reinforcement learning, which comprises the following steps: s1, establishing a motion model of a cluster system; s2, defining a fusion mode of a gamma information map and a cluster information map; s3, defining a state space and a behavior space required by reinforcement learning training; s4, defining an interactive reinforcement learning training method according to the state space and the behavior space; and S5, acquiring a Q value table obtained by training, carrying out region search according to the motion model, and determining the position of the next moment according to the Q value table. The invention realizes the sharing of the learning experience of the neighbors, and in the sharing process, the useless experience is filtered out in a screening mode, so that the learning efficiency is improved, and meanwhile, the communication traffic between the intelligent agents is greatly reduced.

Description

Multi-agent area searching method based on collaborative reinforcement learning

Technical Field

The invention relates to multi-agent area search, in particular to a multi-agent area search method based on collaborative reinforcement learning.

Background

The clustering phenomenon is a very common phenomenon in nature, and with the rise of artificial intelligence in recent years, the intelligent control field becomes a popular research field, and great progress is made in the aspects of intelligent bodies such as unmanned aerial vehicles, unmanned vehicles or mobile robots. The gradual maturity of the single-agent technology pushes the intelligent system to be converted into a cluster, and the packing cluster control algorithm is widely applied to tasks such as unmanned aerial vehicle searching, reconnaissance and striking. Confronted with increasingly complex combat environments and multitask requirements.

Q-learning is a typical reinforcement learning algorithm that converts learned experiences into a Q-table from which the best strategy can be selected. In the traversal process of the agent cluster, the gamma points in the multi-agent search system are planned through Q-learning, and after the learning of a Q-learning algorithm is completed, the optimal planning strategy of the gamma points can be obtained, so that the rapid traversal of the target area is completed.

Because the traditional Q-learning algorithm is an independent learning method, the historical experience of neighbors of the traditional Q-learning algorithm does not need to be used for reference in the learning process, so that the multi-agent system can learn the experience of the behavior in the same state for multiple times, and the learning efficiency of the system is greatly reduced.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a multi-agent area searching method based on collaborative reinforcement learning, so that the sharing of the learning experience of neighbors is realized, useless experience is filtered out in a screening mode in the sharing process, the learning efficiency is improved, and meanwhile, the communication traffic among agents is greatly reduced.

The purpose of the invention is realized by the following technical scheme: a multi-agent area searching method based on collaborative reinforcement learning comprises the following steps:

s1, establishing a motion model of a cluster system;

s2, defining a fusion mode of a gamma information map and a cluster information map;

s3, defining a state space and a behavior space required by reinforcement learning training;

s4, defining an interactive reinforcement learning training method according to the state space and the behavior space;

and S5, acquiring a Q value table obtained by training, carrying out region search according to the motion model, and determining the position of the next moment according to the Q value table.

Further, the step S1 includes the following sub-steps:

based on a packing cluster control algorithm, assuming that a cluster V includes p agents, where V ═ 1,2.. p }, an ith agent in the cluster is defined as agent i, and a kinetic model of the ith agent is expressed as the following equation:

wherein p is_iIs the location, v, of agent i of the agent_iIs the speed, u, of agent i of the agent_iAcceleration, u, for agent i of agent_iControl input for the clustered agent;

during the search process, the control input quantity of each agent of the cluster is expressed as:

for the control input of cluster agents to avoid collisions,

moving the control quantity to a desired position for the cluster agent;

c^sαfor normal numbers, the potential field force between p-agent i and p-agent j is defined as follows:

wherein z is the input quantity, p_iIs the location of cluster agent i;

d_α＝||d||_α

wherein r is_αCommunication distance, σ, between clustered agents₁A, b and c are self-defined parameters;

wherein h and l are constants

The design of the function guarantees the smoothness of the potential field function, and in order to guarantee the norm, the sigma norm is micro-defined:

wherein epsilon is a self-defined parameter;

the cluster agent moves the control amount to the desired position as follows:

in the formula (I), the compound is shown in the specification,

proportional and differential control parameters in PID algorithm, vi is agent i speed, p_γIs the expected location of agent i at the next time.

Further, the step S2 includes the following sub-steps:

assuming that the traversal area is a rectangular area of m × n, quantizing the area to be searched into a gamma-information map of k × l matrixes, wherein each quantized matrix corresponds to a gamma point, converting the complete search of the area into the complete traversal of the gamma points in the information map, and the gamma points form a gamma information map set of agent i

m_i(γ)＝{γ_x,y},x＝1,2....k,y＝1,2....l；

Wherein k and l are obtained by:

rs is a self-defined parameter and represents the perception radius of agent i;

obtaining gamma information map { m) of all agents in cluster₁(γ_x,y),m₂(γ_x,y)......m_p(γ_x,y) Fifthly, if agent i traverses the gamma point, the information m of the gamma point_i(γ_x,y) 1, otherwise m_i(γ_x,y) 0; agent 1, agent2..... agent p establishes communication, and fuses the self gamma information map and the gamma information map of the neighbor thereof, wherein the fusion formula is as follows:

wherein m is_i(γ_x,y) Is the information gamma information map of agent i, m_s(γ_x,y) Is the all gamma information map of the cluster and V is the set of cluster agents.

Further, the step S3 includes:

acquiring the state space of each agent of the cluster, and defining the state of agent i as follows:

in the formula, M_iIs section (gamma)The gamma information of point i maps the coverage situation,

is the next moment of node i at the gamma map location;

acquiring the behavior space of each agent in the cluster, wherein the behavior is represented as the selection of gamma points, and when the node i is in a certain state S_iThe selectable γ points are the current γ map location and 8 of its surrounding, and we denote these 9 locations by 1 to 9, so the node behavior space is defined as the following equation:

A_i＝{1,2,3,4,5,6,7,8,9}。

further, the step S4 includes:

according to the agent's state and behavior, for a typical Q-learning algorithm, the Q-value table update function is as follows:

where k represents the kth training, α is the learning rate, η is the discount factor, a_i' denotes the next action, s_i' in the next state, the state of the,

in order to reduce the calculation complexity of the learning algorithm and the flow and accelerate the convergence speed of the learning algorithm, when agent A in the cluster is connected with other agents, the Q values of the agent A and the other agents can be obtained; considering only the state operation with larger Q value in the neighbor's Q value table for updating the Q value reference of agent, the Q value table of the ith agent in the (k +1) th iteration will be updated as follows:

wherein the content of the first and second substances,

is the Q value of the jth agent, the number of neighbors of agent i

w_jThe weights are defined as follows:

q_irepresenting the location of the ith agent of the cluster, r_aIs a constant that represents the abutting radius,

h_r(. cndot.) is a threshold function defined as follows:

r_iis the return function as follows;

in the formula of gamma_x,y' to perform action a_iThe next gamma point obtained, 0<c_r<1 is a constant, k_rFor the number of repeated traversal, T is the time consumed in the process of traversing the gamma information map or covering the dynamic area, and r (T) is defined as follows:

in the formula

And

is constant, and

r_refis a constant, is a standard return value, T, for the entire traversal process_minThe minimum crossing time under ideal conditions is calculated by the following formula:

wherein m and n are the size of the search area, k and l are the number of corresponding information maps, v_maxIs constant and represents the maximum speed; agent selects the action with the maximum weight value according to the current environment to update the Q value table; compared with the traditional reinforcement learning method, the cooperative reinforcement learning method has the advantages that the efficiency is improved, the training process is optimized through the interaction of Q value tables among the intelligent agents in the cluster, and the training time is shortened.

Further, the step S5 includes:

repeating the steps S1-S4 to update the Q value table of the agent in the iterative cluster until the Q value table converges; the cluster area searching algorithm after the collaborative reinforcement learning training generates a Q value table of the behavior of the whole cluster at the next moment when the whole cluster carries out searching tasks; after the collaborative reinforcement learning training, each intelligent agent selects the best behavior at the next moment according to the Q value table to maximize the efficiency of area search, which is expressed as follows

p_r＝a_i'＝argmaxQ_i(s_i,a_i)

s_iRepresenting the state a of the intelligent agent at the current i moment_iRepresenting the behavior of the agent selected at the current moment; a is_i' represents the selection of the optimal search state at the next moment of the agent;

according to a_i' obtaining the optimal desired position P in S2_rAnd through P_rCalculating the speed and the position of the intelligent agent, and iteratively inquiring a Q value table to realize the regional global search of the cluster:

the invention has the beneficial effects that: the invention provides a cooperative Q-learning algorithm based on the traditional Q-learning, realizes the sharing of the learning experience of the neighbor, filters out useless experience by a screening mode in the sharing process, greatly reduces the communication traffic between intelligent agents while improving the learning efficiency, and improves the cluster searching efficiency and effect.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

Detailed Description

The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following.

As shown in fig. 1, a multi-agent area search method based on cooperative reinforcement learning includes the following steps:

s1, establishing a motion model of a cluster system;

Further, the step S1 includes the following sub-steps:

for the control input of cluster agents to avoid collisions,

moving the control quantity to a desired position for the cluster agent;

wherein z is the input quantity, p_iIs the location of cluster agent i;

d_α＝||d||_σ

wherein h and l are constants

wherein epsilon is a self-defined parameter;

the cluster agent moves the control amount to the desired position as follows:

in the formula (I), the compound is shown in the specification,

Further, the step S2 includes the following sub-steps:

m_i(γ)＝{γ_x,y},x＝1,2....k,y＝1,2....l；

Wherein k and l are obtained by:

rs is a self-defined parameter and represents the perception radius of agent i;

Further, the step S3 includes:

in the formula, M_i(γ) is the γ information map coverage of node i,

is the next moment of node i at the gamma map location;

A_i＝{1,2,3,4,5,6,7,8,9}。

further, the step S4 includes:

wherein the content of the first and second substances,

is the Q value of the jth agent, the number of neighbors of agent i

w_jThe weights are defined as follows:

h_r(. cndot.) is a threshold function defined as follows:

r_iis the return function as follows;

in the formula

And

is constant, and

wherein m and n are the size of the search area, k and l are the number of corresponding information maps, v_maxIs constant and represents the maximum speed; agent selects the action with the maximum weight value according to the current environment to update the Q value table; compared with the traditional reinforcement learning method, the cooperative reinforcement learning method has the advantages that the efficiency is improved, and the Q value among the intelligent agents in the cluster is increasedThe interaction of the tables optimizes the training process and reduces the training time.

Further, the step S5 includes:

p_r＝a_i'＝argmaxQ_i(s_i,a_i)

the invention provides a cooperative Q-learning algorithm based on the traditional Q-learning, realizes the sharing of the learning experience of the neighbor, filters out useless experience by a screening mode in the sharing process, greatly reduces the communication traffic between intelligent agents while improving the learning efficiency, and improves the cluster searching efficiency and effect.

The foregoing is a preferred embodiment of the present invention, it is to be understood that the invention is not limited to the form disclosed herein, but is not to be construed as excluding other embodiments, and is capable of other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A multi-agent area searching method based on collaborative reinforcement learning is characterized in that: the method comprises the following steps:

s1, establishing a motion model of a cluster system;

2. The multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S1 includes the following sub-steps:

for the control input of cluster agents to avoid collisions,

moving the control quantity to a desired position for the cluster agent;

wherein z is the input quantity, p_iIs the location of cluster agent i;

d_α＝||d||_σ

wherein h and l are constants

wherein epsilon is a self-defined parameter;

the cluster agent moves the control amount to the desired position as follows:

in the formula (I), the compound is shown in the specification,

3. The multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S2 includes the following sub-steps:

m_i(γ)＝{γ_x,y},x＝1,2....k,y＝1,2....l；

Wherein k and l are obtained by:

rs is a self-defined parameter and represents the perception radius of agent i;

obtaining gamma information map { m) of all agents in cluster₁(γ_x,y),m₂(γ_x,y)......m_p(γ_x,y) Fifthly, if agenti traverses the gamma point, the information m of the gamma point_i(γ_x,y) 1, otherwise m_i(γ_x,y) 0; agent 1, agent2..... agent p establishes communication, and fuses the self gamma information map and the gamma information map of the neighbor thereof, wherein the fusion formula is as follows:

4. The multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S3 includes:

in the formula, M_i(γ) is the γ information map coverage of node i,

is the next moment of node i at the gamma map location;

A_i＝{1,2,3,4,5,6,7,8,9}。

5. the multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S4 includes:

in order to reduce the calculation complexity of the learning algorithm and the flow and accelerate the convergence speed of the learning algorithm, when the agent A in the cluster is connected with other agents, the Q values of the agent A and the agent A can be obtained; considering only the state operation with larger Q value in the neighbor's Q value table for updating the Q value reference of agent, the Q value table of the ith agent in the (k +1) th iteration will be updated as follows:

wherein Q is_j ^k(s_i,a_i) Is the Q value of the jth agent, the number of neighbors of agent i

w_jThe weights are defined as follows:

h_r(. cndot.) is a threshold function defined as follows:

r_iis the return function as follows;

in the formula

And

is constant, and

wherein m and n are the size of the search area, k and l are the number of corresponding information maps, v_maxIs constant and represents the maximum speed; agent selects the action with the maximum weight value according to the current environmentUpdating the Q value table; compared with the traditional reinforcement learning method, the cooperative reinforcement learning method has the advantages that the efficiency is improved, the training process is optimized through the interaction of Q value tables among the intelligent agents in the cluster, and the training time is shortened.

6. The multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S5 includes:

repeating the steps S1-S5 to update the Q value table of the agent in the iterative cluster until the Q value table converges; the cluster area searching algorithm after the collaborative reinforcement learning training generates a Q value table of the behavior of the whole cluster at the next moment when the whole cluster carries out searching tasks;

after the collaborative reinforcement learning training, each intelligent agent selects the best behavior at the next moment according to the Q value table to maximize the efficiency of area search, which is expressed as follows

p_r＝a_i'＝argmaxQ_i(s_i,a_i)