CN114281103B - Aircraft cluster collaborative search method with zero interaction communication - Google Patents

Aircraft cluster collaborative search method with zero interaction communication Download PDF

Info

Publication number
CN114281103B
CN114281103B CN202111532038.XA CN202111532038A CN114281103B CN 114281103 B CN114281103 B CN 114281103B CN 202111532038 A CN202111532038 A CN 202111532038A CN 114281103 B CN114281103 B CN 114281103B
Authority
CN
China
Prior art keywords
network
agent
search
intrinsic
rewards
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111532038.XA
Other languages
Chinese (zh)
Other versions
CN114281103A (en
Inventor
惠俊鹏
范佳宣
张旭辉
路鹰
陈海鹏
李博遥
黄虎
王振亚
李君�
郑本昌
阎岩
李丝然
何昳頔
张佳
任金磊
吴志壕
刘峰
范中行
王鹏
吴海华
程炳琳
周辉
韩特
王颖昕
刘洋
孟元军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Academy of Launch Vehicle Technology CALT
Original Assignee
China Academy of Launch Vehicle Technology CALT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Academy of Launch Vehicle Technology CALT filed Critical China Academy of Launch Vehicle Technology CALT
Priority to CN202111532038.XA priority Critical patent/CN114281103B/en
Publication of CN114281103A publication Critical patent/CN114281103A/en
Application granted granted Critical
Publication of CN114281103B publication Critical patent/CN114281103B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The aircraft cluster collaborative search method of zero interactive communication utilizes information of target matrix type distribution to count data of target position distribution to form prior information; initializing a search experience pool and a search strategy, and establishing a search task interaction environment; constructing an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, wherein the agents acquire observation information from the environment; building an internal rewarding Q network and an external rewarding Q network for each intelligent body, building a hybrid Q network for an aircraft cluster, and performing learning training; during execution, the intelligent agent performs action selection according to local observation, and the reality constraint of most practical environments is met. The invention solves the problems that the rewarding sparse agent is difficult to obtain rewarding signals in a large scene of the search space, learning is slow, the search process depends on communication and global information, and the like in the prior art.

Description

Aircraft cluster collaborative search method with zero interaction communication
Technical Field
The invention relates to the field of collaborative decision-making of aircraft clusters, in particular to an aircraft cluster collaborative search method with zero interactive communication.
Background
The united states uses the advantage of the electronic countermeasure capability, and the land, sea and air integrated electronic countermeasure means can construct a combat environment for refusing the communication of the opposite party in the target areas such as aircraft carrier, base and the like, the traditional cooperative mode loses the cooperative capability because the communication is pressed, and the combat capability of the combat unit is greatly discounted, so that the development of a new technical means for effectively cooperating in the communication refused environment is urgently needed.
In a non-communication collaborative unmanned cluster, certain achievements have been achieved. The project of the DARRA for the rejection of the environment is under collaborative combat (Collaborative Operations in Denied Environments, abbreviated as CODE), the dependence on communication is reduced through comprehensive means such as algorithm, software, system architecture and the like, and the combat capability of unmanned aerial vehicles or pointed missiles in the rejection environment is enhanced.
Most of the current cluster collaborative search algorithms adopt a method based on communication negotiation or partition traversal, and aim at the dynamic self-adaptive collaborative decision problem of a time-sensitive target, and the problems of low search efficiency, poor search effect, high communication dependence and the like exist.
Disclosure of Invention
The technical solution of the invention is as follows: the aircraft cluster collaborative search method based on the zero-interaction communication solves the problems of low search efficiency, poor search effect, high communication dependence and the like in the prior art.
The technical scheme of the invention is as follows:
an aircraft cluster collaborative search method with zero interaction communication, comprising the following steps:
the method comprises the steps that firstly, data of target position distribution are counted by utilizing information of target matrix distribution, prior information is formed, positions with high probability of occurrence of targets are searched preferentially, and search space is reduced;
secondly, initializing a search experience pool and a search strategy, and establishing a search task interaction environment;
thirdly, constructing an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, and performing collaborative search on a target by utilizing the framework to obtain a search position of each agent at the next moment, and an intrinsic reward and an extrinsic reward at the current moment; each aircraft in the cluster of aircraft is designated as an agent;
fourthly, constructing an internal rewarding Q network and an external rewarding Q network for each intelligent body, and constructing a mixed Q network for the aircraft cluster; the intrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the intrinsic rewards at the current moment through learning and training; the extrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the extrinsic rewards at the current moment through learning and training; the mixed Q network performs credibility allocation for the internal rewarding Q network and the external rewarding Q network of each intelligent agent so as to ensure that the searching position of each intelligent agent at the next moment has a global synergistic effect; obtaining a value function by using the intrinsic reward Q network and the extrinsic reward Q network, and updating an aircraft cluster collaborative search framework;
and fifthly, executing target collaborative search by using an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, wherein each agent performs action selection according to own observation information when executing the target collaborative search.
In the third step, the multi-agent reinforcement learning-based aircraft cluster collaborative search framework comprises a state characterization module, a segment memory pool, a random distillation network and an action network which are configured on each agent.
The aircraft cluster collaborative search framework construction process based on multi-agent reinforcement learning comprises the following steps:
(3.1) in each exchange, each intelligent agent obtains observation information from a search task interaction environment, the observation information is input into a state characterization module, the state characterization module extracts state information which can influence the decision of the intelligent agent from the observation information of the intelligent agent, and outputs the state information to a segment memory pool, a random distillation network and an action network, and the state information and noise which cannot influence the decision of the intelligent agent are removed;
(3.2) the action network determines the search position of the next moment according to the state information of the current moment;
(3.3) calculating an intrinsic novelty bonus r through the segment memory pool t episodic Calculating global novelty intrinsic rewards multiplier through a random distillation network, and finally obtaining intrinsic rewards at the current moment;
(3.4) each agent giving an external reward when it searches for a target from the search task interaction environment.
In the step (3.3), the intrinsic novelty bonus r is calculated through the segment memory pool t episodic The way of (2) is as follows:
at the current moment, the state information extracted from the observation information of the intelligent agent by the state characterization module is marked as f (x) t );
The state representation space distance f (x) is selected in the segment memory pool based on Euclidean distance through k neighbor algorithm t ) The nearest n memory chips, denoted as { f } 1 ,...,f n };
Calculating a novelty bonus inherent in the single search process:
where K is a kernel function that evaluates the distance of two state information.
Epsilon represents a constant, d is a euclidean distance metric function,representing the average of the distances of the n nearest neighbors.
In the step (3.3), the global novelty intrinsic reward multiplier alpha is calculated through a random distillation network t The formula of (2) is as follows:
wherein err (x t ) Representing the observed information x t Prediction error, mu ee Is err (x) t ) Running mean and running standard deviation of (2).
Intrinsic rewards of the h-th agent at the current timeL is a predetermined prize multiplier alpha t Is a minimum of (2).
For each agent, the value functions of the intrinsic and extrinsic rewards Q networks are calculated using the following formula
The intrinsic reward Q network and the extrinsic reward Q network learn using a generic value function approximator:
wherein the method comprises the steps ofAnd->Representing extrinsic and intrinsic prize Q networks, beta j Weight of Q value for intrinsic rewards, beta j ∈R + J e 0,1,..n-1; n is the number of the intelligent agent, gamma j For the discount factor in the Q network learning process, θ is a network parameter of the value function, θ e For network parameters of extrinsic rewards Q network, θ i Is a network parameter of the intrinsic rewards Q network.
The update policy of the hybrid Q network is as follows:
Q h individual value function representing the h-th agent, Q tot Representing a system value function of the hybrid network.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention adopts the collaborative search algorithm of centralized training distributed deployment, can achieve the performance of the collaborative search algorithm based on communication under the condition of no need of communication, and reduces the communication dependency;
2. the invention combines the internal exploration rewards and the external rewards, introduces dense effective rewards in the collaborative search task with sparse rewards, accelerates the learning of the algorithm, and improves the search efficiency and the search effect.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a QMIX-based algorithm architecture of the present invention;
FIG. 3 is a probabilistic approximation network of the present invention based on a controllable state representation network;
FIG. 4 is a diagram of the internal prize design of the present invention;
fig. 5 is a schematic diagram of a Q network architecture based on parameterized decoupling according to the present invention.
Detailed Description
The invention provides an aircraft cluster collaborative search method with zero interaction communication, which solves the problems that in the prior art, a sparse rewarding agent is difficult to acquire rewarding signals in a large scene of a search space, learning is slow, and the search process depends on communication and global information.
As shown in fig. 1, the steps include:
the first step, statistics of data of target position distribution is carried out by utilizing information of target matrix distribution, prior information is formed, positions with high probability of target occurrence are searched preferentially, and search space is reduced.
The construction and utilization process of the prior information is as follows:
(1.1) counting the positions of all targets in the offline data;
(1.2) obtaining a thermodynamic diagram of the target distribution range according to offline data of the target position distribution;
(1.3) initializing the initial position of the agent according to the frequency of occurrence of each unit, and giving a small prize value of 0.1 when the target is accessed to the area where the target may occur.
And secondly, initializing a search experience pool, searching strategies and establishing a search task interaction environment.
Thirdly, constructing an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, wherein the agents acquire observation information from the environment, encourage the multi-agent system to search for an unknown environment, and send search actions to the search environment.
The aircraft cluster collaborative search framework based on multi-agent reinforcement learning comprises a state characterization module, a segment memory pool, a random distillation network and an action network which are configured on each agent.
The construction process of the aircraft cluster collaborative search framework based on multi-agent reinforcement learning is as follows:
(3.1) in each exchange, each intelligent agent obtains observation information from a search task interaction environment, the observation information is input into a state characterization module, the state characterization module extracts state information which can influence the decision of the intelligent agent from the observation information of the intelligent agent, and outputs the state information to a segment memory pool, a random distillation network and an action network, and the state information and noise which cannot influence the decision of the intelligent agent are removed;
(3.2) the action network determines the search position of the next moment according to the state information of the current moment;
(3.3) calculating an intrinsic novelty bonus r through the segment memory pool t episodic Calculating global novelty intrinsic rewards multiplier through a random distillation network, and finally obtaining intrinsic rewards at the current moment;
(3.4) each agent giving an external reward when it searches for a target from the search task interaction environment.
And fourthly, constructing two Q networks for each intelligent agent to learn according to different reward signals, wherein in the training process, global information which can be obtained in the training process is fully utilized, and the observation and strategy of other intelligent agents are used as additional state input, so that the value function estimation of the current intelligent agent is explicitly considered.
The detailed method for constructing two Q networks to learn for different reward signals respectively comprises the following steps:
the controllable hidden variables output by the controllable search state characterization training module are respectively combined with the environmental rewards and the internal rewards calculated by the internal exploration rewards module and are input into two Q networks based on the general value function approximator structure, and the two Q networks are respectively trained to prevent mutual interference between the environmental rewards and the internal rewards.
Constructing a hybrid Q network for an aircraft cluster; the intrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the intrinsic rewards at the current moment through learning and training; the extrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the extrinsic rewards at the current moment through learning and training; the mixed Q network performs credibility allocation for the internal rewarding Q network and the external rewarding Q network of each intelligent agent so as to ensure that the searching position of each intelligent agent at the next moment has a global synergistic effect; the value function is obtained by using the intrinsic reward Q network and the extrinsic reward Q network and is used for updating the aircraft cluster collaborative search framework.
And fifthly, executing target collaborative search by using an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, wherein during execution, the agents perform action selection according to local observation, so that the reality constraint of most practical environments is met.
The training and deployment method for the aircraft cluster collaborative search model for the zero-interaction communication is as follows: and taking the maximum number of targets searched in the shortest possible time as targets, and carrying out learning decision on the search area.
As shown in fig. 2, in order to solve the problem that a single agent algorithm repeatedly makes decisions in the same state, a multi-agent algorithm is introduced to perform bonus distribution among a plurality of agents. In a centralized training independent learning scheme based on parameter sharing, each agent learns by adopting a multi-agent value mixing algorithm. The detailed method comprises the following steps:
in the training process, global information which can be obtained in the training process is fully utilized, and the observation and strategy of other intelligent agents are used as additional state input, so that the current intelligent agent value function estimation is explicitly considered. Under the condition that the observation and the strategy of other agents are unchanged, each agent is expected to adjust the strategy of the agent and make the best response in the current state so as to maximize the overall benefit of the system. When the method is executed, the intelligent agent performs action selection according to local observation, and the real constraint of a zero communication environment is met. In order to better facilitate accurate assessment, action information performed by other agents is acquired accordingly during the environmental risk assessment and agent training process. On the basis, researches are conducted aiming at the reliability distribution problem facing contribution analysis, and a framework of a multi-agent value mixing algorithm is adopted to distribute a value function of the system.
The mixed Q network adopts a Q network based on parameterization decoupling, receives the action value output by the Q network of each intelligent agent, extracts information from the global state, combines external rewards and internal exploration rewards output by an internal exploration rewards module respectively after being processed by a state characterization module capable of controlling search, updates the action value of the whole system, and then returns signals to the Q network of each intelligent agent to update the strategy of the intelligent agent. The multi-agent value based hybrid algorithm needs to ensure that the global application of observations to the joint action value functions yields the same result as that applied to each action value function, i.e.
Q h Individual value function representing the h-th agent, Q tot Representing a system value function of the hybrid network.
To guarantee the above properties, parameters of the hybrid Q network take the state as input, and handle the non-negative weight of the output. By fitting the relation between the individual value function and the system value function by using a multi-agent value mixing algorithm, the contribution of the agent to the system is effectively distributed, and therefore accurate update signals are obtained to realize coordination of the agent system.
As shown in fig. 3Considering that collaborative search environments can be highly complex, there is unpredictable, uncontrollable randomness of some states in the environment, while reinforcement learning agents focus on the part of the state that can actually be affected by the agent's decision-making, this part of the state is called the controllable state. Consider the use of state characterization that is controllable based on reverse prediction learning. The inverse predictive task is to use two consecutive states x t ,x t+1 As input, predicting action a causing the state change t Is a learning task of (a). Consider a state characterization function f (x t ) The controllable state characterization learning task based on reverse prediction can be formalized as expressed as:
the above formula learns through experience generated by interaction of an agent with an environment, and has the visual meaning that the state characterization function should learn hidden variables that can help predict actions that actually cause environmental transfer, and ignore factors of the uncontrollable environment, so that policy learning is performed in a hidden space with more compact information. Wherein, the liquid crystal display device comprises a liquid crystal display device,
p(a t |x t ,x t+1 )=h(f(x t ),f(x t+1 ))
h represents a classifier.
As shown in fig. 4, the purpose of the aircraft cluster is to search for areas of unknown environment most efficiently in the course of searching for the environment, with the natural understanding that for the current office, the agent should not repeat searching for a state that has been searched before. The present invention encourages collaborative search agents to learn this through a novel reward inherent in the single search process.
Specifically, each agent stores a representation of the state experienced by the current office in a segment memory M during interaction with the environment, and calculates an intrinsic novelty benefit of the current state based on the segment memory M. The specific calculation process is as follows:
obtaining the current state x t Characterization f (x) t )。
The state representation space distance f (x) is selected in the segment memory pool M by a k-nearest neighbor algorithm based on Euclidean distance t ) The nearest n memory chips, denoted as { f } 1 ,...,f n }。
Calculating a single search process novelty intrinsic reward:
p∈1,2,...,n。
where K is a kernel function that evaluates the two state characterization distances:
epsilon represents a very small constant, d is some distance measure function, such as euclidean distance,representing the average of the distances of the n nearest neighbors. The calculated novelty rewards of the single search process encourage the searching agent to access the state which is not seen in the current interaction in the same office, thereby avoiding the repeated searching of the searched area in the same office and wasting the searching force.
The inherent exploration rewards of design make the agent obtain denser effective signal, thus accelerate study. Design of intrinsic search rewards while taking into account both the novelty of the single search process interior and the novelty of the global search process, the global novelty intrinsic reward multiplier is multiplied by the computing of the single search process novelty intrinsic rewards as the overall intrinsic search rewards.
The intrinsic novelty rewards allow for novel guidance of the agent in a single interaction, additionally, global novelty of the state for the entire learning process to guide the search learning of the agent from a global perspective.
The global novelty rewards act as multipliers on the single search process novelty rewards, scaling of the rewards according to the novelty of the current state in the global training process. Designing global novelty intrinsic reward multiplier taking into account prediction errors based on random distillation network:
wherein mu ee Is err (x) t ) And running standard deviation. The above indicates that the prize multiplier and pair status x t Prediction error err (x) t ) Positive correlation. The random distillation network mainly predicts the output of the target network by giving a fixed target network and then using a prediction network, and takes the error of such a prediction problem as the reward of searching, wherein the prediction error indicates the novelty of the state for learning the intelligent agent in the whole training process.
Combining the intrinsic novelty rewards with the global novelty intrinsic reward multiplier, the resulting intrinsic rewards of the h-th agent at the current time:
l is a predetermined prize multiplier alpha t Is a minimum of (2).
As shown in fig. 5, two Q networks are trained based on intrinsic rewards and extrinsic rewards, combined, and the Q functions are fitted separately to prevent interference during learning, and β is introduced by using the structure of a generic value function approximator j And gamma j The learning of the Q function is guided as an input to the Q function:
wherein the method comprises the steps ofAnd->Respectively represent based on external prizesQ network, beta, learned from incentives and intrinsic rewards j Weight of Q value for intrinsic rewards, beta j ∈R + ,j∈0,1,...,N-1。γ j For the discount factor in the Q network learning process, θ is a network parameter of the value function, θ e For network parameters of extrinsic rewards Q network, θ i Is a network parameter of the intrinsic rewards Q network. For learning of the Q function, the markov decision process is potentially altered due to the inherent rewards being difficult to predict, such that the original MDP becomes a partially observable markov decision process. To solve this problem, rewards are taken as input information of the Q network, and at the same time, the action information and two types of rewards information obtained in the last step are taken as input of the network, and a representation summarizing all history information (including state information, action information and rewards information) is maintained in the intelligent agent. By adjusting the weight beta j And discount factor gamma in Q function learning process j The search degree, beta, can be adjusted j Larger means more focus on searching the current environment, gamma j Larger means that the current return is more focused than the future return. Beta can be made in early training period j Larger, more attention is paid to the internal rewards at the moment, and more search exploration is carried out; beta can be properly reduced in middle and later stages j Increase gamma j At this point, the external rewards are more focused, encouraging more enemy units to be searched.
Aiming at the requirement of collaborative search decision of the aircraft clusters in the refused environment, the invention uses multi-agent reinforcement learning to decide the action sequence of collaborative search of the multi-agent in the zero communication environment; initializing the initial position of the intelligent agent according to the frequency of each unit occurrence by using prior information, reducing the search space in the search process and improving the search efficiency; the collaborative search algorithm framework based on multi-agent reinforcement learning is provided, a state characterization technology and an inherent exploration rewarding design of controllable search are introduced by using a multi-agent collaborative search strategy which is performed in a centralized training and distributed mode, the aircraft cluster searching effect is effectively improved, and the traffic required by the aircraft cluster collaborative search is zero; in the ground countermeasure deduction simulation training platform, the effectiveness of the invention is verified by a collaborative search test of a red aircraft cluster to a blue Fang Shimin target cluster. The invention solves the problems of low searching efficiency, poor searching effect, high communication dependence and the like in the prior art.
What is not described in detail in the present specification belongs to the known technology of those skilled in the art.

Claims (9)

1. The aircraft cluster collaborative search method with zero interactive communication is characterized by comprising the following steps:
the method comprises the steps that firstly, data of target position distribution are counted by utilizing information of target matrix distribution, prior information is formed, positions with high probability of occurrence of targets are searched preferentially, and search space is reduced;
secondly, initializing a search experience pool and a search strategy, and establishing a search task interaction environment;
thirdly, constructing an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, and performing collaborative search on a target by utilizing the framework to obtain a search position of each agent at the next moment, and an intrinsic reward and an extrinsic reward at the current moment; each aircraft in the cluster of aircraft is designated as an agent;
fourthly, constructing an internal rewarding Q network and an external rewarding Q network for each intelligent body, and constructing a mixed Q network for the aircraft cluster; the intrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the intrinsic rewards at the current moment through learning and training; the extrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the extrinsic rewards at the current moment through learning and training; the mixed Q network performs credibility allocation for the internal rewarding Q network and the external rewarding Q network of each intelligent agent so as to ensure that the searching position of each intelligent agent at the next moment has a global synergistic effect; obtaining a value function by using the intrinsic reward Q network and the extrinsic reward Q network, and updating an aircraft cluster collaborative search framework;
and fifthly, executing target collaborative search by using an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, wherein each agent performs action selection according to own observation information when executing the target collaborative search.
2. The aircraft cluster collaborative search method for zero-interaction communication according to claim 1, wherein: in the third step, the multi-agent reinforcement learning-based aircraft cluster collaborative search framework comprises a state characterization module, a segment memory pool, a random distillation network and an action network which are configured on each agent.
3. The aircraft cluster collaborative search method for zero-interaction communication according to claim 2, wherein: the aircraft cluster collaborative search framework construction process based on multi-agent reinforcement learning comprises the following steps:
(3.1) in each exchange, each intelligent agent obtains observation information from a search task interaction environment, the observation information is input into a state characterization module, the state characterization module extracts state information which can influence the decision of the intelligent agent from the observation information of the intelligent agent, and outputs the state information to a segment memory pool, a random distillation network and an action network, and the state information and noise which cannot influence the decision of the intelligent agent are removed;
(3.2) the action network determines the search position of the next moment according to the state information of the current moment;
(3.3) calculating an intrinsic novelty bonus r through the segment memory pool t episodic Calculating global novelty intrinsic rewards multiplier through a random distillation network, and finally obtaining intrinsic rewards at the current moment;
(3.4) each agent giving an external reward when it searches for a target from the search task interaction environment.
4. The aircraft cluster collaborative search method for zero-interaction communication according to claim 3, wherein: in the step (3.3), the intrinsic novelty bonus r is calculated through the segment memory pool t episodic The way of (2) is as follows:
at the current moment, the state information extracted from the observation information of the intelligent agent by the state characterization module is marked as f (x) t );
The state representation space distance f (x) is selected in the segment memory pool based on Euclidean distance through k neighbor algorithm t ) The nearest n memory chips, denoted as { f } 1 ,...,f n };
Calculating a novelty bonus inherent in the single search process:
where K is a kernel function that evaluates the distance of two state information.
5. The aircraft cluster collaborative search method for zero-interaction communication according to claim 4, wherein:
epsilon represents a constant, d is a euclidean distance metric function,representing the average of the distances of the n nearest neighbors.
6. The aircraft cluster collaborative search method for zero-interaction communication according to claim 3, wherein: in the step (3.3), the global novelty intrinsic reward multiplier alpha is calculated through a random distillation network t The formula of (2) is as follows:
wherein err (x t ) Representing the observed information x t Prediction error, mu ee Is err (x) t ) Running mean and running standard deviation of (2).
7. The aircraft cluster collaborative search method for zero-interaction communication according to claim 6, wherein: intrinsic reward r for the h-th agent at the current time t h =r t episodic ·min{max{α t 1, L is a predetermined prize multiplier α t Is a minimum of (2).
8. The aircraft cluster collaborative search method for zero-interaction communication according to claim 1, wherein: for each agent, the value functions of the intrinsic and extrinsic rewards Q networks are calculated using the following formula
The intrinsic reward Q network and the extrinsic reward Q network learn using a generic value function approximator:
wherein the method comprises the steps ofAnd->Representing extrinsic and intrinsic prize Q networks, beta j Weight of Q value for intrinsic rewards, beta j ∈R + J e 0,1,..n-1; n is the number of the intelligent agent, gamma j For the discount factor in the Q network learning process, θ is a network parameter of the value function, θ e For network parameters of extrinsic rewards Q network, θ i Is a network parameter of the intrinsic rewards Q network.
9. The aircraft cluster collaborative search method for zero-interaction communication according to claim 1, wherein: the update policy of the hybrid Q network is as follows:
Q h individual value function representing the h-th agent, Q tot Representing a system value function of the hybrid network.
CN202111532038.XA 2021-12-14 2021-12-14 Aircraft cluster collaborative search method with zero interaction communication Active CN114281103B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111532038.XA CN114281103B (en) 2021-12-14 2021-12-14 Aircraft cluster collaborative search method with zero interaction communication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111532038.XA CN114281103B (en) 2021-12-14 2021-12-14 Aircraft cluster collaborative search method with zero interaction communication

Publications (2)

Publication Number Publication Date
CN114281103A CN114281103A (en) 2022-04-05
CN114281103B true CN114281103B (en) 2023-09-29

Family

ID=80872321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111532038.XA Active CN114281103B (en) 2021-12-14 2021-12-14 Aircraft cluster collaborative search method with zero interaction communication

Country Status (1)

Country Link
CN (1) CN114281103B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114690623B (en) * 2022-04-21 2022-10-25 中国人民解放军军事科学院战略评估咨询中心 Intelligent agent efficient global exploration method and system for rapid convergence of value function
CN116974729B (en) * 2023-09-22 2024-02-09 浪潮(北京)电子信息产业有限公司 Task scheduling method and device for big data job, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113495578A (en) * 2021-09-07 2021-10-12 南京航空航天大学 Digital twin training-based cluster track planning reinforcement learning method
CN113592162A (en) * 2021-07-22 2021-11-02 西北工业大学 Multi-agent reinforcement learning-based multi-underwater unmanned aircraft collaborative search method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3948671A1 (en) * 2019-05-23 2022-02-09 DeepMind Technologies Limited Jointly learning exploratory and non-exploratory action selection policies
US11703853B2 (en) * 2019-12-03 2023-07-18 University-Industry Cooperation Group Of Kyung Hee University Multiple unmanned aerial vehicles navigation optimization method and multiple unmanned aerial vehicles system using the same

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113592162A (en) * 2021-07-22 2021-11-02 西北工业大学 Multi-agent reinforcement learning-based multi-underwater unmanned aircraft collaborative search method
CN113495578A (en) * 2021-09-07 2021-10-12 南京航空航天大学 Digital twin training-based cluster track planning reinforcement learning method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度强化学习的弹道导弹中段突防控制;南英;蒋亮;;指挥信息系统与技术(第04期);全文 *

Also Published As

Publication number Publication date
CN114281103A (en) 2022-04-05

Similar Documents

Publication Publication Date Title
Jiang et al. Deep-learning-based joint resource scheduling algorithms for hybrid MEC networks
CN114281103B (en) Aircraft cluster collaborative search method with zero interaction communication
CN114415735B (en) Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method
CN114690623B (en) Intelligent agent efficient global exploration method and system for rapid convergence of value function
CN111814066B (en) Dynamic social user alignment method and system based on heuristic algorithm
CN111832911A (en) Underwater combat effectiveness evaluation method based on neural network algorithm
Fan et al. Cb-dsl: Communication-efficient and byzantine-robust distributed swarm learning on non-iid data
CN114219066A (en) Unsupervised reinforcement learning method and unsupervised reinforcement learning device based on Watherstein distance
CN110851911A (en) Terminal state calculation model training method, control sequence searching method and device
CN115630566B (en) Data assimilation method and system based on deep learning and dynamic constraint
Liu et al. Efficient adversarial attacks on online multi-agent reinforcement learning
Sun et al. Learning controlled and targeted communication with the centralized critic for the multi-agent system
CN115730743A (en) Battlefield combat trend prediction method based on deep neural network
CN113300884B (en) GWO-SVR-based step-by-step network flow prediction method
CN114757092A (en) System and method for training multi-agent cooperative communication strategy based on teammate perception
WO2022076061A1 (en) Interactive agent
Samanta et al. Energy management in hybrid electric vehicles using optimized radial basis function neural network
CN116070714B (en) Cloud edge cooperative training method and system based on federal learning and neural architecture search
CN112991384B (en) DDPG-based intelligent cognitive management method for emission resources
US20220114474A1 (en) Interactive agent
Sopov Self-configuring Multi-strategy Genetic Algorithm for Non-stationary Environments
Zhang et al. Multi-agent feature learning and integration for mixed cooperative and competitive environment
CN117590757B (en) Multi-unmanned aerial vehicle cooperative task allocation method based on Gaussian distribution sea-gull optimization algorithm
Zhao et al. Deep Reinforcement Learning‐Based Air Defense Decision‐Making Using Potential Games
Hoang et al. SubIQ: Inverse Soft-Q Learning for Offline Imitation with Suboptimal Demonstrations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant