CN111880564A - Multi-agent area searching method based on collaborative reinforcement learning - Google Patents

Multi-agent area searching method based on collaborative reinforcement learning Download PDF

Info

Publication number
CN111880564A
CN111880564A CN202010710554.6A CN202010710554A CN111880564A CN 111880564 A CN111880564 A CN 111880564A CN 202010710554 A CN202010710554 A CN 202010710554A CN 111880564 A CN111880564 A CN 111880564A
Authority
CN
China
Prior art keywords
agent
cluster
gamma
reinforcement learning
value table
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010710554.6A
Other languages
Chinese (zh)
Inventor
张瑛
肖剑
黄治宇
薛玉玺
吴磊
靳一丹
吴冰航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010710554.6A priority Critical patent/CN111880564A/en
Publication of CN111880564A publication Critical patent/CN111880564A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • G05D1/104Simultaneous control of position or course in three dimensions specially adapted for aircraft involving a plurality of aircrafts, e.g. formation flying

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multi-agent area searching method based on collaborative reinforcement learning, which comprises the following steps: s1, establishing a motion model of a cluster system; s2, defining a fusion mode of a gamma information map and a cluster information map; s3, defining a state space and a behavior space required by reinforcement learning training; s4, defining an interactive reinforcement learning training method according to the state space and the behavior space; and S5, acquiring a Q value table obtained by training, carrying out region search according to the motion model, and determining the position of the next moment according to the Q value table. The invention realizes the sharing of the learning experience of the neighbors, and in the sharing process, the useless experience is filtered out in a screening mode, so that the learning efficiency is improved, and meanwhile, the communication traffic between the intelligent agents is greatly reduced.

Description

Multi-agent area searching method based on collaborative reinforcement learning
Technical Field
The invention relates to multi-agent area search, in particular to a multi-agent area search method based on collaborative reinforcement learning.
Background
The clustering phenomenon is a very common phenomenon in nature, and with the rise of artificial intelligence in recent years, the intelligent control field becomes a popular research field, and great progress is made in the aspects of intelligent bodies such as unmanned aerial vehicles, unmanned vehicles or mobile robots. The gradual maturity of the single-agent technology pushes the intelligent system to be converted into a cluster, and the packing cluster control algorithm is widely applied to tasks such as unmanned aerial vehicle searching, reconnaissance and striking. Confronted with increasingly complex combat environments and multitask requirements.
Q-learning is a typical reinforcement learning algorithm that converts learned experiences into a Q-table from which the best strategy can be selected. In the traversal process of the agent cluster, the gamma points in the multi-agent search system are planned through Q-learning, and after the learning of a Q-learning algorithm is completed, the optimal planning strategy of the gamma points can be obtained, so that the rapid traversal of the target area is completed.
Because the traditional Q-learning algorithm is an independent learning method, the historical experience of neighbors of the traditional Q-learning algorithm does not need to be used for reference in the learning process, so that the multi-agent system can learn the experience of the behavior in the same state for multiple times, and the learning efficiency of the system is greatly reduced.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a multi-agent area searching method based on collaborative reinforcement learning, so that the sharing of the learning experience of neighbors is realized, useless experience is filtered out in a screening mode in the sharing process, the learning efficiency is improved, and meanwhile, the communication traffic among agents is greatly reduced.
The purpose of the invention is realized by the following technical scheme: a multi-agent area searching method based on collaborative reinforcement learning comprises the following steps:
s1, establishing a motion model of a cluster system;
s2, defining a fusion mode of a gamma information map and a cluster information map;
s3, defining a state space and a behavior space required by reinforcement learning training;
s4, defining an interactive reinforcement learning training method according to the state space and the behavior space;
and S5, acquiring a Q value table obtained by training, carrying out region search according to the motion model, and determining the position of the next moment according to the Q value table.
Further, the step S1 includes the following sub-steps:
based on a packing cluster control algorithm, assuming that a cluster V includes p agents, where V ═ 1,2.. p }, an ith agent in the cluster is defined as agent i, and a kinetic model of the ith agent is expressed as the following equation:
Figure BDA0002596372250000021
wherein p isiIs the location, v, of agent i of the agentiIs the speed, u, of agent i of the agentiAcceleration, u, for agent i of agentiControl input for the clustered agent;
during the search process, the control input quantity of each agent of the cluster is expressed as:
Figure BDA0002596372250000022
Figure BDA0002596372250000023
for the control input of cluster agents to avoid collisions,
Figure BDA0002596372250000024
moving the control quantity to a desired position for the cluster agent;
Figure DEST_PATH_1
cfor normal numbers, the potential field force between p-agent i and p-agent j is defined as follows:
Figure BDA0002596372250000026
wherein z is the input quantity, piIs the location of cluster agent i;
Figure BDA0002596372250000027
dα=||d||α
Figure BDA0002596372250000028
wherein r isαCommunication distance, σ, between clustered agents1A, b and c are self-defined parameters;
Figure BDA0002596372250000029
wherein h and l are constants
Figure BDA00025963722500000210
The design of the function guarantees the smoothness of the potential field function, and in order to guarantee the norm, the sigma norm is micro-defined:
Figure BDA00025963722500000211
wherein epsilon is a self-defined parameter;
the cluster agent moves the control amount to the desired position as follows:
Figure BDA00025963722500000212
in the formula (I), the compound is shown in the specification,
Figure BDA00025963722500000213
proportional and differential control parameters in PID algorithm, vi is agent i speed, pγIs the expected location of agent i at the next time.
Further, the step S2 includes the following sub-steps:
assuming that the traversal area is a rectangular area of m × n, quantizing the area to be searched into a gamma-information map of k × l matrixes, wherein each quantized matrix corresponds to a gamma point, converting the complete search of the area into the complete traversal of the gamma points in the information map, and the gamma points form a gamma information map set of agent i
mi(γ)={γx,y},x=1,2....k,y=1,2....l;
Wherein k and l are obtained by:
Figure BDA0002596372250000031
rs is a self-defined parameter and represents the perception radius of agent i;
obtaining gamma information map { m) of all agents in cluster1x,y),m2x,y)......mpx,y) Fifthly, if agent i traverses the gamma point, the information m of the gamma pointix,y) 1, otherwise mix,y) 0; agent 1, agent2..... agent p establishes communication, and fuses the self gamma information map and the gamma information map of the neighbor thereof, wherein the fusion formula is as follows:
Figure BDA0002596372250000032
wherein m isix,y) Is the information gamma information map of agent i, msx,y) Is the all gamma information map of the cluster and V is the set of cluster agents.
Further, the step S3 includes:
acquiring the state space of each agent of the cluster, and defining the state of agent i as follows:
Figure BDA0002596372250000033
in the formula, MiIs section (gamma)The gamma information of point i maps the coverage situation,
Figure BDA0002596372250000034
is the next moment of node i at the gamma map location;
acquiring the behavior space of each agent in the cluster, wherein the behavior is represented as the selection of gamma points, and when the node i is in a certain state SiThe selectable γ points are the current γ map location and 8 of its surrounding, and we denote these 9 locations by 1 to 9, so the node behavior space is defined as the following equation:
Ai={1,2,3,4,5,6,7,8,9}。
further, the step S4 includes:
according to the agent's state and behavior, for a typical Q-learning algorithm, the Q-value table update function is as follows:
Figure BDA0002596372250000041
where k represents the kth training, α is the learning rate, η is the discount factor, ai' denotes the next action, si' in the next state, the state of the,
in order to reduce the calculation complexity of the learning algorithm and the flow and accelerate the convergence speed of the learning algorithm, when agent A in the cluster is connected with other agents, the Q values of the agent A and the other agents can be obtained; considering only the state operation with larger Q value in the neighbor's Q value table for updating the Q value reference of agent, the Q value table of the ith agent in the (k +1) th iteration will be updated as follows:
Figure BDA0002596372250000042
Figure BDA0002596372250000043
wherein the content of the first and second substances,
Figure BDA0002596372250000044
is the Q value of the jth agent, the number of neighbors of agent i
Figure BDA0002596372250000045
wjThe weights are defined as follows:
Figure BDA0002596372250000046
qirepresenting the location of the ith agent of the cluster, raIs a constant that represents the abutting radius,
Figure BDA0002596372250000047
Figure BDA0002596372250000048
hr(. cndot.) is a threshold function defined as follows:
Figure BDA0002596372250000049
riis the return function as follows;
Figure BDA00025963722500000410
in the formula of gammax,y' to perform action aiThe next gamma point obtained, 0<cr<1 is a constant, krFor the number of repeated traversal, T is the time consumed in the process of traversing the gamma information map or covering the dynamic area, and r (T) is defined as follows:
Figure BDA0002596372250000051
in the formula
Figure BDA0002596372250000052
And
Figure BDA0002596372250000053
is constant, and
Figure BDA0002596372250000054
rrefis a constant, is a standard return value, T, for the entire traversal processminThe minimum crossing time under ideal conditions is calculated by the following formula:
Figure BDA0002596372250000055
wherein m and n are the size of the search area, k and l are the number of corresponding information maps, vmaxIs constant and represents the maximum speed; agent selects the action with the maximum weight value according to the current environment to update the Q value table; compared with the traditional reinforcement learning method, the cooperative reinforcement learning method has the advantages that the efficiency is improved, the training process is optimized through the interaction of Q value tables among the intelligent agents in the cluster, and the training time is shortened.
Further, the step S5 includes:
repeating the steps S1-S4 to update the Q value table of the agent in the iterative cluster until the Q value table converges; the cluster area searching algorithm after the collaborative reinforcement learning training generates a Q value table of the behavior of the whole cluster at the next moment when the whole cluster carries out searching tasks; after the collaborative reinforcement learning training, each intelligent agent selects the best behavior at the next moment according to the Q value table to maximize the efficiency of area search, which is expressed as follows
pr=ai'=argmaxQi(si,ai)
siRepresenting the state a of the intelligent agent at the current i momentiRepresenting the behavior of the agent selected at the current moment; a isi' represents the selection of the optimal search state at the next moment of the agent;
according to ai' obtaining the optimal desired position P in S2rAnd through PrCalculating the speed and the position of the intelligent agent, and iteratively inquiring a Q value table to realize the regional global search of the cluster:
Figure BDA0002596372250000056
Figure BDA0002596372250000057
the invention has the beneficial effects that: the invention provides a cooperative Q-learning algorithm based on the traditional Q-learning, realizes the sharing of the learning experience of the neighbor, filters out useless experience by a screening mode in the sharing process, greatly reduces the communication traffic between intelligent agents while improving the learning efficiency, and improves the cluster searching efficiency and effect.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the accompanying drawings, but the scope of the present invention is not limited to the following.
As shown in fig. 1, a multi-agent area search method based on cooperative reinforcement learning includes the following steps:
s1, establishing a motion model of a cluster system;
s2, defining a fusion mode of a gamma information map and a cluster information map;
s3, defining a state space and a behavior space required by reinforcement learning training;
s4, defining an interactive reinforcement learning training method according to the state space and the behavior space;
and S5, acquiring a Q value table obtained by training, carrying out region search according to the motion model, and determining the position of the next moment according to the Q value table.
Further, the step S1 includes the following sub-steps:
based on a packing cluster control algorithm, assuming that a cluster V includes p agents, where V ═ 1,2.. p }, an ith agent in the cluster is defined as agent i, and a kinetic model of the ith agent is expressed as the following equation:
Figure BDA0002596372250000061
wherein p isiIs the location, v, of agent i of the agentiIs the speed, u, of agent i of the agentiAcceleration, u, for agent i of agentiControl input for the clustered agent;
during the search process, the control input quantity of each agent of the cluster is expressed as:
Figure BDA0002596372250000062
Figure BDA0002596372250000063
for the control input of cluster agents to avoid collisions,
Figure BDA0002596372250000064
moving the control quantity to a desired position for the cluster agent;
Figure 485851DEST_PATH_1
cfor normal numbers, the potential field force between p-agent i and p-agent j is defined as follows:
Figure BDA0002596372250000066
wherein z is the input quantity, piIs the location of cluster agent i;
Figure BDA0002596372250000067
dα=||d||σ
Figure BDA0002596372250000068
wherein r isαCommunication distance, σ, between clustered agents1A, b and c are self-defined parameters;
Figure BDA0002596372250000071
wherein h and l are constants
Figure BDA0002596372250000072
The design of the function guarantees the smoothness of the potential field function, and in order to guarantee the norm, the sigma norm is micro-defined:
Figure BDA0002596372250000073
wherein epsilon is a self-defined parameter;
the cluster agent moves the control amount to the desired position as follows:
Figure BDA0002596372250000074
in the formula (I), the compound is shown in the specification,
Figure BDA0002596372250000075
proportional and differential control parameters in PID algorithm, vi is agent i speed, pγIs the expected location of agent i at the next time.
Further, the step S2 includes the following sub-steps:
assuming that the traversal area is a rectangular area of m × n, quantizing the area to be searched into a gamma-information map of k × l matrixes, wherein each quantized matrix corresponds to a gamma point, converting the complete search of the area into the complete traversal of the gamma points in the information map, and the gamma points form a gamma information map set of agent i
mi(γ)={γx,y},x=1,2....k,y=1,2....l;
Wherein k and l are obtained by:
Figure BDA0002596372250000076
rs is a self-defined parameter and represents the perception radius of agent i;
obtaining gamma information map { m) of all agents in cluster1x,y),m2x,y)......mpx,y) Fifthly, if agent i traverses the gamma point, the information m of the gamma pointix,y) 1, otherwise mix,y) 0; agent 1, agent2..... agent p establishes communication, and fuses the self gamma information map and the gamma information map of the neighbor thereof, wherein the fusion formula is as follows:
Figure BDA0002596372250000077
wherein m isix,y) Is the information gamma information map of agent i, msx,y) Is the all gamma information map of the cluster and V is the set of cluster agents.
Further, the step S3 includes:
acquiring the state space of each agent of the cluster, and defining the state of agent i as follows:
Figure BDA0002596372250000081
in the formula, Mi(γ) is the γ information map coverage of node i,
Figure BDA0002596372250000082
is the next moment of node i at the gamma map location;
acquiring the behavior space of each agent in the cluster, wherein the behavior is represented as the selection of gamma points, and when the node i is in a certain state SiThe selectable γ points are the current γ map location and 8 of its surrounding, and we denote these 9 locations by 1 to 9, so the node behavior space is defined as the following equation:
Ai={1,2,3,4,5,6,7,8,9}。
further, the step S4 includes:
according to the agent's state and behavior, for a typical Q-learning algorithm, the Q-value table update function is as follows:
Figure BDA0002596372250000083
where k represents the kth training, α is the learning rate, η is the discount factor, ai' denotes the next action, si' in the next state, the state of the,
in order to reduce the calculation complexity of the learning algorithm and the flow and accelerate the convergence speed of the learning algorithm, when agent A in the cluster is connected with other agents, the Q values of the agent A and the other agents can be obtained; considering only the state operation with larger Q value in the neighbor's Q value table for updating the Q value reference of agent, the Q value table of the ith agent in the (k +1) th iteration will be updated as follows:
Figure BDA0002596372250000084
Figure BDA0002596372250000085
wherein the content of the first and second substances,
Figure BDA0002596372250000086
is the Q value of the jth agent, the number of neighbors of agent i
Figure BDA0002596372250000087
wjThe weights are defined as follows:
Figure BDA0002596372250000088
qirepresenting the location of the ith agent of the cluster, raIs a constant that represents the abutting radius,
Figure BDA0002596372250000089
Figure BDA00025963722500000810
hr(. cndot.) is a threshold function defined as follows:
Figure BDA0002596372250000091
riis the return function as follows;
Figure BDA0002596372250000092
in the formula of gammax,y' to perform action aiThe next gamma point obtained, 0<cr<1 is a constant, krFor the number of repeated traversal, T is the time consumed in the process of traversing the gamma information map or covering the dynamic area, and r (T) is defined as follows:
Figure BDA0002596372250000093
in the formula
Figure BDA0002596372250000094
And
Figure BDA0002596372250000095
is constant, and
Figure BDA0002596372250000096
rrefis a constant, is a standard return value, T, for the entire traversal processminThe minimum crossing time under ideal conditions is calculated by the following formula:
Figure BDA0002596372250000097
wherein m and n are the size of the search area, k and l are the number of corresponding information maps, vmaxIs constant and represents the maximum speed; agent selects the action with the maximum weight value according to the current environment to update the Q value table; compared with the traditional reinforcement learning method, the cooperative reinforcement learning method has the advantages that the efficiency is improved, and the Q value among the intelligent agents in the cluster is increasedThe interaction of the tables optimizes the training process and reduces the training time.
Further, the step S5 includes:
repeating the steps S1-S4 to update the Q value table of the agent in the iterative cluster until the Q value table converges; the cluster area searching algorithm after the collaborative reinforcement learning training generates a Q value table of the behavior of the whole cluster at the next moment when the whole cluster carries out searching tasks; after the collaborative reinforcement learning training, each intelligent agent selects the best behavior at the next moment according to the Q value table to maximize the efficiency of area search, which is expressed as follows
pr=ai'=argmaxQi(si,ai)
siRepresenting the state a of the intelligent agent at the current i momentiRepresenting the behavior of the agent selected at the current moment; a isi' represents the selection of the optimal search state at the next moment of the agent;
according to ai' obtaining the optimal desired position P in S2rAnd through PrCalculating the speed and the position of the intelligent agent, and iteratively inquiring a Q value table to realize the regional global search of the cluster:
Figure BDA0002596372250000101
Figure BDA0002596372250000102
the invention provides a cooperative Q-learning algorithm based on the traditional Q-learning, realizes the sharing of the learning experience of the neighbor, filters out useless experience by a screening mode in the sharing process, greatly reduces the communication traffic between intelligent agents while improving the learning efficiency, and improves the cluster searching efficiency and effect.
The foregoing is a preferred embodiment of the present invention, it is to be understood that the invention is not limited to the form disclosed herein, but is not to be construed as excluding other embodiments, and is capable of other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. A multi-agent area searching method based on collaborative reinforcement learning is characterized in that: the method comprises the following steps:
s1, establishing a motion model of a cluster system;
s2, defining a fusion mode of a gamma information map and a cluster information map;
s3, defining a state space and a behavior space required by reinforcement learning training;
s4, defining an interactive reinforcement learning training method according to the state space and the behavior space;
and S5, acquiring a Q value table obtained by training, carrying out region search according to the motion model, and determining the position of the next moment according to the Q value table.
2. The multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S1 includes the following sub-steps:
based on a packing cluster control algorithm, assuming that a cluster V includes p agents, where V ═ 1,2.. p }, an ith agent in the cluster is defined as agent i, and a kinetic model of the ith agent is expressed as the following equation:
Figure FDA0002596372240000011
wherein p isiIs the location, v, of agent i of the agentiIs the speed, u, of agent i of the agentiAcceleration, u, for agent i of agentiControl input for the clustered agent;
during the search process, the control input quantity of each agent of the cluster is expressed as:
Figure FDA0002596372240000012
Figure FDA0002596372240000013
for the control input of cluster agents to avoid collisions,
Figure FDA0002596372240000014
moving the control quantity to a desired position for the cluster agent;
Figure 1
cfor normal numbers, the potential field force between p-agent i and p-agent j is defined as follows:
Figure FDA0002596372240000016
wherein z is the input quantity, piIs the location of cluster agent i;
Figure FDA0002596372240000017
dα=||d||σ
Figure FDA0002596372240000018
wherein r isαCommunication distance, σ, between clustered agents1A, b and c are self-defined parameters;
Figure FDA0002596372240000021
wherein h and l are constants
Figure FDA0002596372240000022
The design of the function guarantees the smoothness of the potential field function, and in order to guarantee the norm, the sigma norm is micro-defined:
Figure FDA0002596372240000023
wherein epsilon is a self-defined parameter;
the cluster agent moves the control amount to the desired position as follows:
Figure FDA0002596372240000024
in the formula (I), the compound is shown in the specification,
Figure FDA0002596372240000025
proportional and differential control parameters in PID algorithm, vi is agent i speed, pγIs the expected location of agent i at the next time.
3. The multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S2 includes the following sub-steps:
assuming that the traversal area is a rectangular area of m × n, quantizing the area to be searched into a gamma-information map of k × l matrixes, wherein each quantized matrix corresponds to a gamma point, converting the complete search of the area into the complete traversal of the gamma points in the information map, and the gamma points form a gamma information map set of agent i
mi(γ)={γx,y},x=1,2....k,y=1,2....l;
Wherein k and l are obtained by:
Figure FDA0002596372240000026
rs is a self-defined parameter and represents the perception radius of agent i;
obtaining gamma information map { m) of all agents in cluster1x,y),m2x,y)......mpx,y) Fifthly, if agenti traverses the gamma point, the information m of the gamma pointix,y) 1, otherwise mix,y) 0; agent 1, agent2..... agent p establishes communication, and fuses the self gamma information map and the gamma information map of the neighbor thereof, wherein the fusion formula is as follows:
Figure FDA0002596372240000027
wherein m isix,y) Is the information gamma information map of agent i, msx,y) Is the all gamma information map of the cluster and V is the set of cluster agents.
4. The multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S3 includes:
acquiring the state space of each agent of the cluster, and defining the state of agent i as follows:
Figure FDA0002596372240000031
in the formula, Mi(γ) is the γ information map coverage of node i,
Figure FDA0002596372240000032
is the next moment of node i at the gamma map location;
acquiring the behavior space of each agent in the cluster, wherein the behavior is represented as the selection of gamma points, and when the node i is in a certain state SiThe selectable γ points are the current γ map location and 8 of its surrounding, and we denote these 9 locations by 1 to 9, so the node behavior space is defined as the following equation:
Ai={1,2,3,4,5,6,7,8,9}。
5. the multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S4 includes:
according to the agent's state and behavior, for a typical Q-learning algorithm, the Q-value table update function is as follows:
Figure FDA0002596372240000033
where k represents the kth training, α is the learning rate, η is the discount factor, ai' denotes the next action, si' in the next state, the state of the,
in order to reduce the calculation complexity of the learning algorithm and the flow and accelerate the convergence speed of the learning algorithm, when the agent A in the cluster is connected with other agents, the Q values of the agent A and the agent A can be obtained; considering only the state operation with larger Q value in the neighbor's Q value table for updating the Q value reference of agent, the Q value table of the ith agent in the (k +1) th iteration will be updated as follows:
Figure FDA0002596372240000034
Figure FDA0002596372240000035
wherein Q isj k(si,ai) Is the Q value of the jth agent, the number of neighbors of agent i
Figure FDA0002596372240000036
wjThe weights are defined as follows:
Figure FDA0002596372240000037
qirepresenting the location of the ith agent of the cluster, raIs a constant that represents the abutting radius,
Figure FDA0002596372240000038
Figure FDA0002596372240000041
hr(. cndot.) is a threshold function defined as follows:
Figure FDA0002596372240000042
riis the return function as follows;
Figure FDA0002596372240000043
in the formula of gammax,y' to perform action aiThe next gamma point obtained, 0<cr<1 is a constant, krFor the number of repeated traversal, T is the time consumed in the process of traversing the gamma information map or covering the dynamic area, and r (T) is defined as follows:
Figure FDA0002596372240000044
in the formula
Figure FDA0002596372240000045
And
Figure FDA0002596372240000046
is constant, and
Figure FDA0002596372240000047
rrefis a constant, is a standard return value, T, for the entire traversal processminThe minimum crossing time under ideal conditions is calculated by the following formula:
Figure FDA0002596372240000048
wherein m and n are the size of the search area, k and l are the number of corresponding information maps, vmaxIs constant and represents the maximum speed; agent selects the action with the maximum weight value according to the current environmentUpdating the Q value table; compared with the traditional reinforcement learning method, the cooperative reinforcement learning method has the advantages that the efficiency is improved, the training process is optimized through the interaction of Q value tables among the intelligent agents in the cluster, and the training time is shortened.
6. The multi-agent area search method based on cooperative reinforcement learning as claimed in claim 1, wherein: the step S5 includes:
repeating the steps S1-S5 to update the Q value table of the agent in the iterative cluster until the Q value table converges; the cluster area searching algorithm after the collaborative reinforcement learning training generates a Q value table of the behavior of the whole cluster at the next moment when the whole cluster carries out searching tasks;
after the collaborative reinforcement learning training, each intelligent agent selects the best behavior at the next moment according to the Q value table to maximize the efficiency of area search, which is expressed as follows
pr=ai'=argmaxQi(si,ai)
siRepresenting the state a of the intelligent agent at the current i momentiRepresenting the behavior of the agent selected at the current moment; a isi' represents the selection of the optimal search state at the next moment of the agent;
according to ai' obtaining the optimal desired position P in S2rAnd through PrCalculating the speed and the position of the intelligent agent, and iteratively inquiring a Q value table to realize the regional global search of the cluster:
Figure FDA0002596372240000051
Figure FDA0002596372240000052
CN202010710554.6A 2020-07-22 2020-07-22 Multi-agent area searching method based on collaborative reinforcement learning Pending CN111880564A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010710554.6A CN111880564A (en) 2020-07-22 2020-07-22 Multi-agent area searching method based on collaborative reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010710554.6A CN111880564A (en) 2020-07-22 2020-07-22 Multi-agent area searching method based on collaborative reinforcement learning

Publications (1)

Publication Number Publication Date
CN111880564A true CN111880564A (en) 2020-11-03

Family

ID=73155230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010710554.6A Pending CN111880564A (en) 2020-07-22 2020-07-22 Multi-agent area searching method based on collaborative reinforcement learning

Country Status (1)

Country Link
CN (1) CN111880564A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113156954A (en) * 2021-04-25 2021-07-23 电子科技大学 Multi-agent cluster obstacle avoidance method based on reinforcement learning
CN113189983A (en) * 2021-04-13 2021-07-30 中国人民解放军国防科技大学 Open scene-oriented multi-robot cooperative multi-target sampling method
CN113515130A (en) * 2021-08-26 2021-10-19 鲁东大学 Method and storage medium for agent path planning
CN113592162A (en) * 2021-07-22 2021-11-02 西北工业大学 Multi-agent reinforcement learning-based multi-underwater unmanned aircraft collaborative search method
CN113645317A (en) * 2021-10-15 2021-11-12 中国科学院自动化研究所 Loose cluster control method, device, equipment, medium and product
CN114326749A (en) * 2022-01-11 2022-04-12 电子科技大学长三角研究院(衢州) Deep Q-Learning-based cluster area coverage method
CN114610024A (en) * 2022-02-25 2022-06-10 电子科技大学 Multi-agent collaborative search energy-saving method used in mountain environment
CN114764251A (en) * 2022-05-13 2022-07-19 电子科技大学 Energy-saving method for multi-agent collaborative search based on energy consumption model
CN114815820A (en) * 2022-04-18 2022-07-29 电子科技大学 Intelligent vehicle linear path planning method based on adaptive filtering

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040017313A1 (en) * 2002-03-12 2004-01-29 Alberto Menache Motion tracking system and method
CN110109358A (en) * 2019-05-17 2019-08-09 电子科技大学 A kind of mixing multiple agent cooperative control method based on feedback
CN110995006A (en) * 2019-11-28 2020-04-10 深圳第三代半导体研究院 Design method of power electronic transformer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040017313A1 (en) * 2002-03-12 2004-01-29 Alberto Menache Motion tracking system and method
CN110109358A (en) * 2019-05-17 2019-08-09 电子科技大学 A kind of mixing multiple agent cooperative control method based on feedback
CN110995006A (en) * 2019-11-28 2020-04-10 深圳第三代半导体研究院 Design method of power electronic transformer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖剑: "基于增强学习的Flocking集群协同控制算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113189983B (en) * 2021-04-13 2022-05-31 中国人民解放军国防科技大学 Open scene-oriented multi-robot cooperative multi-target sampling method
CN113189983A (en) * 2021-04-13 2021-07-30 中国人民解放军国防科技大学 Open scene-oriented multi-robot cooperative multi-target sampling method
CN113156954A (en) * 2021-04-25 2021-07-23 电子科技大学 Multi-agent cluster obstacle avoidance method based on reinforcement learning
CN113592162A (en) * 2021-07-22 2021-11-02 西北工业大学 Multi-agent reinforcement learning-based multi-underwater unmanned aircraft collaborative search method
CN113592162B (en) * 2021-07-22 2023-06-02 西北工业大学 Multi-agent reinforcement learning-based multi-underwater unmanned vehicle collaborative search method
CN113515130A (en) * 2021-08-26 2021-10-19 鲁东大学 Method and storage medium for agent path planning
CN113515130B (en) * 2021-08-26 2024-02-02 鲁东大学 Method and storage medium for agent path planning
CN113645317A (en) * 2021-10-15 2021-11-12 中国科学院自动化研究所 Loose cluster control method, device, equipment, medium and product
CN113645317B (en) * 2021-10-15 2022-01-18 中国科学院自动化研究所 Loose cluster control method, device, equipment, medium and product
CN114326749A (en) * 2022-01-11 2022-04-12 电子科技大学长三角研究院(衢州) Deep Q-Learning-based cluster area coverage method
CN114326749B (en) * 2022-01-11 2023-10-13 电子科技大学长三角研究院(衢州) Deep Q-Learning-based cluster area coverage method
CN114610024A (en) * 2022-02-25 2022-06-10 电子科技大学 Multi-agent collaborative search energy-saving method used in mountain environment
CN114610024B (en) * 2022-02-25 2023-06-02 电子科技大学 Multi-agent collaborative searching energy-saving method for mountain land
CN114815820A (en) * 2022-04-18 2022-07-29 电子科技大学 Intelligent vehicle linear path planning method based on adaptive filtering
CN114815820B (en) * 2022-04-18 2023-10-03 电子科技大学 Intelligent body trolley linear path planning method based on adaptive filtering
CN114764251A (en) * 2022-05-13 2022-07-19 电子科技大学 Energy-saving method for multi-agent collaborative search based on energy consumption model
CN114764251B (en) * 2022-05-13 2023-10-10 电子科技大学 Multi-agent collaborative search energy-saving method based on energy consumption model

Similar Documents

Publication Publication Date Title
CN111880564A (en) Multi-agent area searching method based on collaborative reinforcement learning
Bayerlein et al. Trajectory optimization for autonomous flying base station via reinforcement learning
CN110502033B (en) Fixed-wing unmanned aerial vehicle cluster control method based on reinforcement learning
CN110347155B (en) Intelligent vehicle automatic driving control method and system
CN110632931A (en) Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN111061277A (en) Unmanned vehicle global path planning method and device
CN111625019B (en) Trajectory planning method for four-rotor unmanned aerial vehicle suspension air transportation system based on reinforcement learning
CN110531786B (en) Unmanned aerial vehicle maneuvering strategy autonomous generation method based on DQN
CN110181508B (en) Three-dimensional route planning method and system for underwater robot
CN110544296A (en) intelligent planning method for three-dimensional global flight path of unmanned aerial vehicle in environment with uncertain enemy threat
CN112230678A (en) Three-dimensional unmanned aerial vehicle path planning method and planning system based on particle swarm optimization
CN110991972A (en) Cargo transportation system based on multi-agent reinforcement learning
Schaal et al. Assessing the quality of learned local models
CN113821041B (en) Multi-robot collaborative navigation and obstacle avoidance method
CN112766499A (en) Method for realizing autonomous flight of unmanned aerial vehicle through reinforcement learning technology
CN113268081A (en) Small unmanned aerial vehicle prevention and control command decision method and system based on reinforcement learning
Bayerlein et al. Learning to rest: A Q-learning approach to flying base station trajectory design with landing spots
CN116451934B (en) Multi-unmanned aerial vehicle edge calculation path optimization and dependent task scheduling optimization method and system
Zhang et al. Danger-aware adaptive composition of drl agents for self-navigation
CN114089776A (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN114020024A (en) Unmanned aerial vehicle path planning method based on Monte Carlo tree search
CN114610024B (en) Multi-agent collaborative searching energy-saving method for mountain land
CN116225046A (en) Unmanned aerial vehicle autonomous path planning method based on deep reinforcement learning under unknown environment
Zhu et al. A novel method combining leader-following control and reinforcement learning for pursuit evasion games of multi-agent systems
Zhang et al. Path planning of patrol robot based on modified grey wolf optimizer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201103

RJ01 Rejection of invention patent application after publication