CN114281103B

CN114281103B - Aircraft cluster collaborative search method with zero interaction communication

Info

Publication number: CN114281103B
Application number: CN202111532038.XA
Authority: CN
Inventors: 惠俊鹏; 范佳宣; 张旭辉; 路鹰; 陈海鹏; 李博遥; 黄虎; 王振亚; 李君�; 郑本昌; 阎岩; 李丝然; 何昳頔; 张佳; 任金磊; 吴志壕; 刘峰; 范中行; 王鹏; 吴海华
Original assignee: China Academy of Launch Vehicle Technology CALT
Current assignee: China Academy of Launch Vehicle Technology CALT
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2023-09-29
Anticipated expiration: 2041-12-14
Also published as: CN114281103A

Abstract

The aircraft cluster collaborative search method of zero interactive communication utilizes information of target matrix type distribution to count data of target position distribution to form prior information; initializing a search experience pool and a search strategy, and establishing a search task interaction environment; constructing an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, wherein the agents acquire observation information from the environment; building an internal rewarding Q network and an external rewarding Q network for each intelligent body, building a hybrid Q network for an aircraft cluster, and performing learning training; during execution, the intelligent agent performs action selection according to local observation, and the reality constraint of most practical environments is met. The invention solves the problems that the rewarding sparse agent is difficult to obtain rewarding signals in a large scene of the search space, learning is slow, the search process depends on communication and global information, and the like in the prior art.

Description

Aircraft cluster collaborative search method with zero interaction communication

Technical Field

The invention relates to the field of collaborative decision-making of aircraft clusters, in particular to an aircraft cluster collaborative search method with zero interactive communication.

Background

The united states uses the advantage of the electronic countermeasure capability, and the land, sea and air integrated electronic countermeasure means can construct a combat environment for refusing the communication of the opposite party in the target areas such as aircraft carrier, base and the like, the traditional cooperative mode loses the cooperative capability because the communication is pressed, and the combat capability of the combat unit is greatly discounted, so that the development of a new technical means for effectively cooperating in the communication refused environment is urgently needed.

In a non-communication collaborative unmanned cluster, certain achievements have been achieved. The project of the DARRA for the rejection of the environment is under collaborative combat (Collaborative Operations in Denied Environments, abbreviated as CODE), the dependence on communication is reduced through comprehensive means such as algorithm, software, system architecture and the like, and the combat capability of unmanned aerial vehicles or pointed missiles in the rejection environment is enhanced.

Most of the current cluster collaborative search algorithms adopt a method based on communication negotiation or partition traversal, and aim at the dynamic self-adaptive collaborative decision problem of a time-sensitive target, and the problems of low search efficiency, poor search effect, high communication dependence and the like exist.

Disclosure of Invention

The technical solution of the invention is as follows: the aircraft cluster collaborative search method based on the zero-interaction communication solves the problems of low search efficiency, poor search effect, high communication dependence and the like in the prior art.

The technical scheme of the invention is as follows:

an aircraft cluster collaborative search method with zero interaction communication, comprising the following steps:

the method comprises the steps that firstly, data of target position distribution are counted by utilizing information of target matrix distribution, prior information is formed, positions with high probability of occurrence of targets are searched preferentially, and search space is reduced;

secondly, initializing a search experience pool and a search strategy, and establishing a search task interaction environment;

thirdly, constructing an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, and performing collaborative search on a target by utilizing the framework to obtain a search position of each agent at the next moment, and an intrinsic reward and an extrinsic reward at the current moment; each aircraft in the cluster of aircraft is designated as an agent;

fourthly, constructing an internal rewarding Q network and an external rewarding Q network for each intelligent body, and constructing a mixed Q network for the aircraft cluster; the intrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the intrinsic rewards at the current moment through learning and training; the extrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the extrinsic rewards at the current moment through learning and training; the mixed Q network performs credibility allocation for the internal rewarding Q network and the external rewarding Q network of each intelligent agent so as to ensure that the searching position of each intelligent agent at the next moment has a global synergistic effect; obtaining a value function by using the intrinsic reward Q network and the extrinsic reward Q network, and updating an aircraft cluster collaborative search framework;

and fifthly, executing target collaborative search by using an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, wherein each agent performs action selection according to own observation information when executing the target collaborative search.

In the third step, the multi-agent reinforcement learning-based aircraft cluster collaborative search framework comprises a state characterization module, a segment memory pool, a random distillation network and an action network which are configured on each agent.

The aircraft cluster collaborative search framework construction process based on multi-agent reinforcement learning comprises the following steps:

(3.1) in each exchange, each intelligent agent obtains observation information from a search task interaction environment, the observation information is input into a state characterization module, the state characterization module extracts state information which can influence the decision of the intelligent agent from the observation information of the intelligent agent, and outputs the state information to a segment memory pool, a random distillation network and an action network, and the state information and noise which cannot influence the decision of the intelligent agent are removed;

(3.2) the action network determines the search position of the next moment according to the state information of the current moment;

(3.3) calculating an intrinsic novelty bonus r through the segment memory pool _t ^episodic Calculating global novelty intrinsic rewards multiplier through a random distillation network, and finally obtaining intrinsic rewards at the current moment;

(3.4) each agent giving an external reward when it searches for a target from the search task interaction environment.

In the step (3.3), the intrinsic novelty bonus r is calculated through the segment memory pool _t ^episodic The way of (2) is as follows:

at the current moment, the state information extracted from the observation information of the intelligent agent by the state characterization module is marked as f (x) _t )；

The state representation space distance f (x) is selected in the segment memory pool based on Euclidean distance through k neighbor algorithm _t ) The nearest n memory chips, denoted as { f } ₁ ,...,f _n }；

Calculating a novelty bonus inherent in the single search process:

where K is a kernel function that evaluates the distance of two state information.

Epsilon represents a constant, d is a euclidean distance metric function,representing the average of the distances of the n nearest neighbors.

In the step (3.3), the global novelty intrinsic reward multiplier alpha is calculated through a random distillation network _t The formula of (2) is as follows:

wherein err (x _t ) Representing the observed information x _t Prediction error, mu _e ,σ _e Is err (x) _t ) Running mean and running standard deviation of (2).

Intrinsic rewards of the h-th agent at the current timeL is a predetermined prize multiplier alpha _t Is a minimum of (2).

For each agent, the value functions of the intrinsic and extrinsic rewards Q networks are calculated using the following formula

The intrinsic reward Q network and the extrinsic reward Q network learn using a generic value function approximator:

wherein the method comprises the steps ofAnd->Representing extrinsic and intrinsic prize Q networks, beta _j Weight of Q value for intrinsic rewards, beta _j ∈R ⁺ J e 0,1,..n-1; n is the number of the intelligent agent, gamma _j For the discount factor in the Q network learning process, θ is a network parameter of the value function, θ ^e For network parameters of extrinsic rewards Q network, θ ⁱ Is a network parameter of the intrinsic rewards Q network.

The update policy of the hybrid Q network is as follows:

Q _h individual value function representing the h-th agent, Q _tot Representing a system value function of the hybrid network.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention adopts the collaborative search algorithm of centralized training distributed deployment, can achieve the performance of the collaborative search algorithm based on communication under the condition of no need of communication, and reduces the communication dependency;

2. the invention combines the internal exploration rewards and the external rewards, introduces dense effective rewards in the collaborative search task with sparse rewards, accelerates the learning of the algorithm, and improves the search efficiency and the search effect.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of a QMIX-based algorithm architecture of the present invention;

FIG. 3 is a probabilistic approximation network of the present invention based on a controllable state representation network;

FIG. 4 is a diagram of the internal prize design of the present invention;

fig. 5 is a schematic diagram of a Q network architecture based on parameterized decoupling according to the present invention.

Detailed Description

The invention provides an aircraft cluster collaborative search method with zero interaction communication, which solves the problems that in the prior art, a sparse rewarding agent is difficult to acquire rewarding signals in a large scene of a search space, learning is slow, and the search process depends on communication and global information.

As shown in fig. 1, the steps include:

the first step, statistics of data of target position distribution is carried out by utilizing information of target matrix distribution, prior information is formed, positions with high probability of target occurrence are searched preferentially, and search space is reduced.

The construction and utilization process of the prior information is as follows:

(1.1) counting the positions of all targets in the offline data;

(1.2) obtaining a thermodynamic diagram of the target distribution range according to offline data of the target position distribution;

(1.3) initializing the initial position of the agent according to the frequency of occurrence of each unit, and giving a small prize value of 0.1 when the target is accessed to the area where the target may occur.

And secondly, initializing a search experience pool, searching strategies and establishing a search task interaction environment.

Thirdly, constructing an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, wherein the agents acquire observation information from the environment, encourage the multi-agent system to search for an unknown environment, and send search actions to the search environment.

The aircraft cluster collaborative search framework based on multi-agent reinforcement learning comprises a state characterization module, a segment memory pool, a random distillation network and an action network which are configured on each agent.

The construction process of the aircraft cluster collaborative search framework based on multi-agent reinforcement learning is as follows:

And fourthly, constructing two Q networks for each intelligent agent to learn according to different reward signals, wherein in the training process, global information which can be obtained in the training process is fully utilized, and the observation and strategy of other intelligent agents are used as additional state input, so that the value function estimation of the current intelligent agent is explicitly considered.

The detailed method for constructing two Q networks to learn for different reward signals respectively comprises the following steps:

the controllable hidden variables output by the controllable search state characterization training module are respectively combined with the environmental rewards and the internal rewards calculated by the internal exploration rewards module and are input into two Q networks based on the general value function approximator structure, and the two Q networks are respectively trained to prevent mutual interference between the environmental rewards and the internal rewards.

Constructing a hybrid Q network for an aircraft cluster; the intrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the intrinsic rewards at the current moment through learning and training; the extrinsic rewards Q network of each intelligent agent receives global state information at the current moment, the searching position at the next moment and the reliability distributed by the mixed Q network, and the output value of each intelligent agent approaches to the extrinsic rewards at the current moment through learning and training; the mixed Q network performs credibility allocation for the internal rewarding Q network and the external rewarding Q network of each intelligent agent so as to ensure that the searching position of each intelligent agent at the next moment has a global synergistic effect; the value function is obtained by using the intrinsic reward Q network and the extrinsic reward Q network and is used for updating the aircraft cluster collaborative search framework.

And fifthly, executing target collaborative search by using an aircraft cluster collaborative search framework based on multi-agent reinforcement learning, wherein during execution, the agents perform action selection according to local observation, so that the reality constraint of most practical environments is met.

The training and deployment method for the aircraft cluster collaborative search model for the zero-interaction communication is as follows: and taking the maximum number of targets searched in the shortest possible time as targets, and carrying out learning decision on the search area.

As shown in fig. 2, in order to solve the problem that a single agent algorithm repeatedly makes decisions in the same state, a multi-agent algorithm is introduced to perform bonus distribution among a plurality of agents. In a centralized training independent learning scheme based on parameter sharing, each agent learns by adopting a multi-agent value mixing algorithm. The detailed method comprises the following steps:

in the training process, global information which can be obtained in the training process is fully utilized, and the observation and strategy of other intelligent agents are used as additional state input, so that the current intelligent agent value function estimation is explicitly considered. Under the condition that the observation and the strategy of other agents are unchanged, each agent is expected to adjust the strategy of the agent and make the best response in the current state so as to maximize the overall benefit of the system. When the method is executed, the intelligent agent performs action selection according to local observation, and the real constraint of a zero communication environment is met. In order to better facilitate accurate assessment, action information performed by other agents is acquired accordingly during the environmental risk assessment and agent training process. On the basis, researches are conducted aiming at the reliability distribution problem facing contribution analysis, and a framework of a multi-agent value mixing algorithm is adopted to distribute a value function of the system.

The mixed Q network adopts a Q network based on parameterization decoupling, receives the action value output by the Q network of each intelligent agent, extracts information from the global state, combines external rewards and internal exploration rewards output by an internal exploration rewards module respectively after being processed by a state characterization module capable of controlling search, updates the action value of the whole system, and then returns signals to the Q network of each intelligent agent to update the strategy of the intelligent agent. The multi-agent value based hybrid algorithm needs to ensure that the global application of observations to the joint action value functions yields the same result as that applied to each action value function, i.e.

To guarantee the above properties, parameters of the hybrid Q network take the state as input, and handle the non-negative weight of the output. By fitting the relation between the individual value function and the system value function by using a multi-agent value mixing algorithm, the contribution of the agent to the system is effectively distributed, and therefore accurate update signals are obtained to realize coordination of the agent system.

As shown in fig. 3Considering that collaborative search environments can be highly complex, there is unpredictable, uncontrollable randomness of some states in the environment, while reinforcement learning agents focus on the part of the state that can actually be affected by the agent's decision-making, this part of the state is called the controllable state. Consider the use of state characterization that is controllable based on reverse prediction learning. The inverse predictive task is to use two consecutive states x _t ,x _t+1 As input, predicting action a causing the state change _t Is a learning task of (a). Consider a state characterization function f (x _t ) The controllable state characterization learning task based on reverse prediction can be formalized as expressed as:

the above formula learns through experience generated by interaction of an agent with an environment, and has the visual meaning that the state characterization function should learn hidden variables that can help predict actions that actually cause environmental transfer, and ignore factors of the uncontrollable environment, so that policy learning is performed in a hidden space with more compact information. Wherein, the liquid crystal display device comprises a liquid crystal display device,

p(a _t |x _t ,x _t+1 )＝h(f(x _t ),f(x _t+1 ))

h represents a classifier.

As shown in fig. 4, the purpose of the aircraft cluster is to search for areas of unknown environment most efficiently in the course of searching for the environment, with the natural understanding that for the current office, the agent should not repeat searching for a state that has been searched before. The present invention encourages collaborative search agents to learn this through a novel reward inherent in the single search process.

Specifically, each agent stores a representation of the state experienced by the current office in a segment memory M during interaction with the environment, and calculates an intrinsic novelty benefit of the current state based on the segment memory M. The specific calculation process is as follows:

obtaining the current state x _t Characterization f (x) _t )。

The state representation space distance f (x) is selected in the segment memory pool M by a k-nearest neighbor algorithm based on Euclidean distance _t ) The nearest n memory chips, denoted as { f } ₁ ,...,f _n }。

Calculating a single search process novelty intrinsic reward:

p∈1,2,...,n。

where K is a kernel function that evaluates the two state characterization distances:

epsilon represents a very small constant, d is some distance measure function, such as euclidean distance,representing the average of the distances of the n nearest neighbors. The calculated novelty rewards of the single search process encourage the searching agent to access the state which is not seen in the current interaction in the same office, thereby avoiding the repeated searching of the searched area in the same office and wasting the searching force.

The inherent exploration rewards of design make the agent obtain denser effective signal, thus accelerate study. Design of intrinsic search rewards while taking into account both the novelty of the single search process interior and the novelty of the global search process, the global novelty intrinsic reward multiplier is multiplied by the computing of the single search process novelty intrinsic rewards as the overall intrinsic search rewards.

The intrinsic novelty rewards allow for novel guidance of the agent in a single interaction, additionally, global novelty of the state for the entire learning process to guide the search learning of the agent from a global perspective.

The global novelty rewards act as multipliers on the single search process novelty rewards, scaling of the rewards according to the novelty of the current state in the global training process. Designing global novelty intrinsic reward multiplier taking into account prediction errors based on random distillation network:

wherein mu _e ,σ _e Is err (x) _t ) And running standard deviation. The above indicates that the prize multiplier and pair status x _t Prediction error err (x) _t ) Positive correlation. The random distillation network mainly predicts the output of the target network by giving a fixed target network and then using a prediction network, and takes the error of such a prediction problem as the reward of searching, wherein the prediction error indicates the novelty of the state for learning the intelligent agent in the whole training process.

Combining the intrinsic novelty rewards with the global novelty intrinsic reward multiplier, the resulting intrinsic rewards of the h-th agent at the current time:

l is a predetermined prize multiplier alpha _t Is a minimum of (2).

As shown in fig. 5, two Q networks are trained based on intrinsic rewards and extrinsic rewards, combined, and the Q functions are fitted separately to prevent interference during learning, and β is introduced by using the structure of a generic value function approximator _j And gamma _j The learning of the Q function is guided as an input to the Q function:

wherein the method comprises the steps ofAnd->Respectively represent based on external prizesQ network, beta, learned from incentives and intrinsic rewards _j Weight of Q value for intrinsic rewards, beta _j ∈R ⁺ ,j∈0,1,...,N-1。γ _j For the discount factor in the Q network learning process, θ is a network parameter of the value function, θ ^e For network parameters of extrinsic rewards Q network, θ ⁱ Is a network parameter of the intrinsic rewards Q network. For learning of the Q function, the markov decision process is potentially altered due to the inherent rewards being difficult to predict, such that the original MDP becomes a partially observable markov decision process. To solve this problem, rewards are taken as input information of the Q network, and at the same time, the action information and two types of rewards information obtained in the last step are taken as input of the network, and a representation summarizing all history information (including state information, action information and rewards information) is maintained in the intelligent agent. By adjusting the weight beta _j And discount factor gamma in Q function learning process _j The search degree, beta, can be adjusted _j Larger means more focus on searching the current environment, gamma _j Larger means that the current return is more focused than the future return. Beta can be made in early training period _j Larger, more attention is paid to the internal rewards at the moment, and more search exploration is carried out; beta can be properly reduced in middle and later stages _j Increase gamma _j At this point, the external rewards are more focused, encouraging more enemy units to be searched.

Aiming at the requirement of collaborative search decision of the aircraft clusters in the refused environment, the invention uses multi-agent reinforcement learning to decide the action sequence of collaborative search of the multi-agent in the zero communication environment; initializing the initial position of the intelligent agent according to the frequency of each unit occurrence by using prior information, reducing the search space in the search process and improving the search efficiency; the collaborative search algorithm framework based on multi-agent reinforcement learning is provided, a state characterization technology and an inherent exploration rewarding design of controllable search are introduced by using a multi-agent collaborative search strategy which is performed in a centralized training and distributed mode, the aircraft cluster searching effect is effectively improved, and the traffic required by the aircraft cluster collaborative search is zero; in the ground countermeasure deduction simulation training platform, the effectiveness of the invention is verified by a collaborative search test of a red aircraft cluster to a blue Fang Shimin target cluster. The invention solves the problems of low searching efficiency, poor searching effect, high communication dependence and the like in the prior art.

What is not described in detail in the present specification belongs to the known technology of those skilled in the art.

Claims

1. The aircraft cluster collaborative search method with zero interactive communication is characterized by comprising the following steps:

2. The aircraft cluster collaborative search method for zero-interaction communication according to claim 1, wherein: in the third step, the multi-agent reinforcement learning-based aircraft cluster collaborative search framework comprises a state characterization module, a segment memory pool, a random distillation network and an action network which are configured on each agent.

3. The aircraft cluster collaborative search method for zero-interaction communication according to claim 2, wherein: the aircraft cluster collaborative search framework construction process based on multi-agent reinforcement learning comprises the following steps:

4. The aircraft cluster collaborative search method for zero-interaction communication according to claim 3, wherein: in the step (3.3), the intrinsic novelty bonus r is calculated through the segment memory pool _t ^episodic The way of (2) is as follows:

Calculating a novelty bonus inherent in the single search process:

5. The aircraft cluster collaborative search method for zero-interaction communication according to claim 4, wherein:

6. The aircraft cluster collaborative search method for zero-interaction communication according to claim 3, wherein: in the step (3.3), the global novelty intrinsic reward multiplier alpha is calculated through a random distillation network _t The formula of (2) is as follows:

7. The aircraft cluster collaborative search method for zero-interaction communication according to claim 6, wherein: intrinsic reward r for the h-th agent at the current time _t ^h ＝r _t ^episodic ·min{max{α _t 1, L is a predetermined prize multiplier α _t Is a minimum of (2).

8. The aircraft cluster collaborative search method for zero-interaction communication according to claim 1, wherein: for each agent, the value functions of the intrinsic and extrinsic rewards Q networks are calculated using the following formula

9. The aircraft cluster collaborative search method for zero-interaction communication according to claim 1, wherein: the update policy of the hybrid Q network is as follows: