CN117707219A

CN117707219A - Unmanned aerial vehicle cluster investigation countermeasure method and device based on deep reinforcement learning

Info

Publication number: CN117707219A
Application number: CN202410162415.2A
Authority: CN
Inventors: 赵亚楠; 刘晓雨; 肖奔
Original assignee: Xian Lingkong Electronic Technology Co Ltd
Current assignee: Xian Lingkong Electronic Technology Co Ltd
Priority date: 2024-02-05
Filing date: 2024-02-05
Publication date: 2024-03-15

Abstract

The application discloses unmanned aerial vehicle cluster investigation countermeasure method and device based on deep reinforcement learning, wherein the method comprises the following steps: constructing an countermeasure space; constructing an countermeasure network model based on an improved deep reinforcement learning algorithm, and setting an experience buffer zone; setting self states and executing actions for the unmanned aerial vehicle, and setting a global rewarding function of the unmanned aerial vehicle cluster; acquiring a current observation value of the unmanned aerial vehicle cluster, and determining that the execution actions of all unmanned aerial vehicles form a joint action of the unmanned aerial vehicle cluster; executing the combined action in the three-dimensional countermeasure environment to obtain a global rewarding value and an observation value at the next moment, and storing countermeasure experiences in an experience buffer area; a set of challenge experiences are randomly sampled as samples to train a challenge network model, and the performance of the trained challenge network model is evaluated. The problem that the DQN is limited in the unmanned aerial vehicle cluster investigation countermeasure environment in the prior art is solved, and therefore the unmanned aerial vehicle cluster investigation countermeasure method with better simulation countermeasure effect is achieved, and the practical reference value is achieved.

Description

Unmanned aerial vehicle cluster investigation countermeasure method and device based on deep reinforcement learning

Technical Field

The application relates to the technical field of unmanned aerial vehicle cluster control, in particular to an unmanned aerial vehicle cluster investigation countermeasure method and device based on deep reinforcement learning.

Background

Unmanned aerial vehicles are highly autonomous agents that are able to sense the environment and make decisions to accomplish tasks or cope with changes, with significant advantages over manned aircraft in some special tasks. However, the single-frame unmanned aerial vehicle has limited detection capability and loadable electronic equipment, and is difficult to effectively cope with high-speed, complex and changeable work tasks. The unmanned aerial vehicle cluster can just make up the disadvantage of single unmanned aerial vehicle. The unmanned aerial vehicle cluster is a multi-agent system composed of a certain number of unmanned aerial vehicles, can coordinate with each other, can jointly complete specific tasks, and is widely applied to the fields of aeronautics, express logistics, precise agriculture, urban traffic and the like.

The multi-agent deep reinforcement learning method provides a new idea for autonomous detection of unmanned aerial vehicle clusters. DQN (deep reinforcement learning algorithm, deep Q network, a Q learning algorithm based on deep learning) is one of typical algorithms in the field of deep reinforcement learning, and the principle thereof can be summarized as follows: the step of reinforcement learning is applied to explore a feasible strategy in a decision space, the gains under different states and actions are recorded, the gains are recorded into experiences and are put into an experience pool, and the sample data in the experience pool is applied to train the deep Q network, so that the Q network can estimate the action state value. The balance between "exploration" and "utilization" is a key factor in determining the training quality of the DQN model. In an unmanned plane cluster reconnaissance countermeasure environment, decision action space is too large, randomness in a certain range exists in reconnaissance, countermeasure behaviors and the like in the training process, and only a depth Q network strategy has limitation.

Disclosure of Invention

According to the unmanned aerial vehicle cluster investigation countermeasure method based on deep reinforcement learning, the problem that the DQN is limited in use in an unmanned aerial vehicle cluster investigation countermeasure environment in the prior art is solved, and the problem can be solved by the method.

In a first aspect, an embodiment of the present application provides a method for performing cluster investigation and countermeasure on an unmanned aerial vehicle based on deep reinforcement learning, including: constructing a countermeasure space comprising a three-dimensional countermeasure environment and two countermeasures; wherein, the countermeasures both sides comprise unmanned aerial vehicle clusters of both sides formed by the same number of detectors and attack machines; constructing an countermeasure network model based on an improved deep reinforcement learning algorithm, and setting an experience buffer zone; setting a current self state and execution action for the unmanned aerial vehicle in the unmanned aerial vehicle cluster, and setting a global rewarding function of the unmanned aerial vehicle cluster; acquiring a current observation value of the three-dimensional countermeasure environment observed by the unmanned aerial vehicle cluster, and determining that the execution action of each unmanned aerial vehicle in the unmanned aerial vehicle cluster forms a joint action of the unmanned aerial vehicle cluster; executing the combined action in the three-dimensional countermeasure environment, so as to obtain a global rewarding value of the unmanned aerial vehicle cluster and an observation value at the next moment, and storing countermeasure experiences in the experience buffer area; wherein the countermeasures include the current observation, the joint action, the global rewards value and the next time observation; randomly sampling a set of the challenge experiences in the experience buffer as samples to train the challenge network model, and evaluating the performance of the challenge network model after training.

With reference to the first aspect, in one possible implementation manner, the improved deep reinforcement learning algorithm includes: reconstructing weights and biases of nodes in the neural network into noise weights and noise biases with noise so as to improve a deep reinforcement learning algorithm; wherein the noise weights and the noise offsets are obtained based on training parameters of the countermeasure network model and random noise.

With reference to the first aspect, in one possible implementation manner, the self state is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->Representing said self-state of an ith probe in said unmanned aerial vehicle cluster of the opposing party,/>、/>And->Respectively representing the positions of the ith detector in the x-axis, the y-axis and the z-axis,/->、/>And->Respectively representing the speed of the ith detector in the directions of x axis, y axis and z axis,/for the detector>Indicating the roll angle of the ith detector, < +.>Represents the pitch angle of the ith detector, +.>Indicating the yaw angle of the ith detector, < >>List information of said unmanned aerial vehicle clusters representing the opponents detected by the ith detector,/->G respectively represents the survival state and radar switch information of the detector in the unmanned aerial vehicle cluster of the opposite side detected by the ith detector;

the method comprises the steps of carrying out a first treatment on the surface of the In (1) the- >Indicating the status of the i-th attack machine in the unmanned aerial vehicle cluster of the opponent party,/->、/>And->Respectively representing the positions of the ith attack machine in the x axis, the y axis and the z axis,/->、/>And->Respectively representing the speed of the ith attack machine in the directions of the x axis, the y axis and the z axis,/for the first attack machine>Indicating the roll angle of the ith attack machine, +.>Representing pitch angle of i-th attack machine, +.>Indicating the yaw angle of the ith attack machine, < +.>List information indicating the opponent unmanned aerial vehicle cluster detected by the ith attack machine, +.>G respectively represents the survival state of the attack machine in the opposite unmanned aerial vehicle cluster detected by the ith attack machine and radar switch information; c represents the attack times of the ith attack machine.

With reference to the first aspect, in one possible implementation manner, the setting a global rewards function for the unmanned aerial vehicle cluster includes: setting a reward rule in the unmanned aerial vehicle cluster countermeasure process; wherein, the rewarding rule comprises a detecting machine rewarding rule, an attacking machine rewarding rule and a winning or losing rewarding rule; the global rewards function is determined based on the rewards rule.

With reference to the first aspect, in one possible implementation manner, before the obtaining the current observed value of the three-dimensional countermeasure environment observed by the unmanned aerial vehicle cluster, the method further includes: setting countermeasure parameters in the countermeasure process for the unmanned aerial vehicle cluster; wherein the challenge parameters include the number of unmanned aerial vehicles of the challenge parties participating in the challenge, the number of challenge rounds, and the maximum interaction length per round.

With reference to the first aspect, in a possible implementation manner, the training the countermeasure network model by randomly sampling a set of the countermeasure experiences in the experience buffer as samples includes: iteratively executing the updating step until the iteration times of the countermeasure network model reach preset training times; the updating step includes: randomly sampling a set of said challenge experiences from said experience buffer and determining a target Q value using said challenge network model based on said sampled challenge experiences; inputting the current self state of each unmanned aerial vehicle in the unmanned aerial vehicle cluster into the countermeasure network model to obtain a current Q value corresponding to the execution action; and determining a loss value according to the target Q value and the current Q value, and updating parameters of the countermeasure network model according to the loss value.

With reference to the first aspect, in one possible implementation manner, the formula for determining the target Q value is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein Q represents the target Q value, < >>Representing said own state of the drone at instant i, and (2)>Representing said execution action of the unmanned aerial vehicle at instant i,/->Representing random noise +.>And->Parameters representing the main network in the countermeasure network model, < > >Representing an instant prize obtained after performing said performing action at instant i, +_>Representing a discount factor, a representing the set of actions of the unmanned aerial vehicle that perform the action.

With reference to the first aspect, in one possible implementation manner, the formula for determining the loss value is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>，The method comprises the steps of carrying out a first treatment on the surface of the In (1) the->Indicating a loss value->Representing the target Q value,/>Representing the current Q value,/>Representing said own state of the drone at instant i, and (2)>Representing said execution action of the unmanned aerial vehicle at instant i,/->Representing the random noise and the noise level of the random noise,and->Parameters representing the main network in the countermeasure network model, < >>Representing an instant prize obtained after performing said performing action at instant i, +_>Representing a discount factor, A representing an action set of said execution actions of the unmanned aerial vehicle, +.>And->Parameters representing the target network in the countermeasure network model.

With reference to the first aspect, in one possible implementation manner, the formula for updating the parameters of the countermeasure network model according to the loss value is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The method comprises the steps of carrying out a first treatment on the surface of the In the formula (-) and->,/>,/>,/>) Indicating that the countermeasure network model requires trainingParameter of->Represent learning rate of optimizer, +.>Representing the difference between the target Q value and the current Q value,/or- >Representing the value function with respect to the parameter->Is used for the gradient of (a),representing the value function with respect to the parameter->Gradient of->Representing the value function with respect to the parameter->Gradient of->Representing the value function with respect to the parameter->Gradient of->Representing said own state of the drone at instant i, and (2)>Representing said execution action of the unmanned aerial vehicle at instant i,/->Representing random noise +.>And->Parameters representing the primary network in the countermeasure network model.

In a second aspect, an embodiment of the present application provides an unmanned aerial vehicle cluster investigation countermeasure device based on deep reinforcement learning, including: the construction environment module is used for constructing a countermeasure space comprising a three-dimensional countermeasure environment and a countermeasure party; wherein, the two countermeasures comprise unmanned aerial vehicle clusters formed by the same number of detectors and attack machines; the model building module is used for building an countermeasure network model based on an improved deep reinforcement learning algorithm and setting an experience buffer area; the setting module is used for setting the current self state and execution action of the unmanned aerial vehicle in the unmanned aerial vehicle cluster and setting a reward function for the unmanned aerial vehicle cluster; the determining module is used for obtaining a current observation value of the three-dimensional countermeasure environment observed by the unmanned aerial vehicle cluster and determining that the execution actions of all unmanned aerial vehicles in the unmanned aerial vehicle cluster form a joint action of the unmanned aerial vehicle cluster; the countermeasure module is used for executing the combined action in the three-dimensional countermeasure environment so as to obtain a reward value of the unmanned aerial vehicle cluster and an observation value at the next moment, and storing countermeasure experience into the experience buffer area; wherein the countermeasures include the current observation, the joint action, the reward value, and the observation at the next time; the training evaluation module is used for randomly sampling a group of countermeasure experiences in the experience buffer area as samples to train the countermeasure network model, and evaluating the performance of the trained countermeasure network model.

One or more technical solutions provided in the embodiments of the present application at least have the following technical effects or advantages:

according to the embodiment of the application, the robustness of the antagonism network model to environmental changes can be enhanced by improving the deep reinforcement learning algorithm; by randomly sampling a group of counterexperiences from the experience buffer, the correlation between samples can be broken, and the training effect can be improved; the influence of the joint action of the unmanned aerial vehicle cluster on the countermeasure result can be evaluated by setting the global rewarding function. The method effectively solves the problems that in the prior art, the DQN is too large in decision action space in an unmanned aerial vehicle cluster reconnaissance countermeasure environment, and in addition, reconnaissance, countermeasure behaviors and the like in the training process have randomness in a certain range, further realizes an unmanned aerial vehicle cluster reconnaissance countermeasure method based on deep reinforcement learning, builds a countermeasure network model based on the improved DQN, has better simulation countermeasure effect and has practical reference value.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the embodiments of the present application or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for detecting and countering a cluster of unmanned aerial vehicles based on deep reinforcement learning according to an embodiment of the present application;

FIG. 2 is a flowchart of training an countermeasure network model by randomly sampling a set of countermeasure experiences in an experience buffer according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an unmanned aerial vehicle cluster investigation countermeasure device based on deep reinforcement learning according to an embodiment of the present application;

FIG. 4 is a schematic view of a heading angle provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of an experience buffer according to an embodiment of the present application;

fig. 6 is an exemplary diagram of a challenge network model provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Some of the techniques involved in the embodiments of the present application are described below to aid understanding, and they should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, for the sake of clarity and conciseness, descriptions of well-known functions and constructions are omitted in the following description.

Fig. 1 is a flowchart of a method for detecting and countering a cluster of unmanned aerial vehicles based on deep reinforcement learning according to an embodiment of the present application, including steps 101 to 106. Fig. 1 is only one execution order shown in the embodiment of the present application, and does not represent the only execution order of the method for performing the cluster investigation countermeasure of the unmanned aerial vehicle based on deep reinforcement learning, and the steps shown in fig. 1 may be executed in parallel or in reverse in case that the final result can be achieved.

Step 101: constructing a countermeasure space comprising a three-dimensional countermeasure environment and both countermeasure. Wherein, the countering both sides comprise unmanned aerial vehicle clusters of both sides composed of the same number of detecting machines and attack machines. Both the attack machine and the detector have the function of omni-directional exploration. In the process of constructing the countermeasure space, the influence of factors such as the shape, the mass distribution, the damage probability of different positions of the fuselage and the like of each unmanned aerial vehicle on the countermeasure situation is not considered, the unmanned aerial vehicle is simplified into one particle, and the movement of the unmanned aerial vehicle is approximate to particle movement. Illustratively, the countermeasure double party may be set as a red party and a blue party, respectively.

In the embodiments of the present application, an appropriate coordinate system is established in a three-dimensional countermeasure environment. Illustratively, the direction of the positive east is taken as the direction of the X axis, the direction of the positive north is taken as the direction of the Y axis, and the Z axis points to the sky direction and is perpendicular to the X axis and the Y axis.

And setting a basic data related module of the unmanned aerial vehicle cluster in the countermeasure scene. In the embodiment of the application, each unmanned aerial vehicle in the unmanned aerial vehicle cluster is an agent denoted by a, and the unmanned aerial vehicle clusters of the opposing parties include n unmanned aerial vehicles in total, and are collectedAnd (3) representing.

And establishing an unmanned plane intelligent body model, wherein for each detector in the two countermeasures, the action selected by the unmanned plane intelligent body model at the time step t consists of a moving module and a detecting module. For each attack machine in the cluster, the action selected at the time step t is composed of three modules of movement, detection and attack. For the movement module, the drone selects one direction in three-dimensional space and moves a unit step in unit time along this direction. And when the radar state is on, the detection module executes a detection task, carries out omnidirectional detection according to a certain detection radius, and records the detected unmanned aerial vehicle of the opposite side. And for the attack module, if one unmanned aerial vehicle detects the same unmanned aerial vehicle of the other unmanned aerial vehicle in a plurality of continuous time steps, carrying out attack.

Step 102: and constructing an antagonism network model based on an improved deep reinforcement learning algorithm, and setting an experience buffer. In an embodiment of the present application, an improved deep reinforcement learning algorithm includes:

Reconstructing weights and biases of nodes in the neural network into noise weights and noise biases with noise so as to improve a deep reinforcement learning algorithm; wherein the noise weights and noise offsets are obtained based on training parameters of the countermeasure network model and random noise. The method comprises the following steps:

，/>. In (1) the->And->Representing noise weight and noise bias, (-) ->,/>,/>,/>) Training parameters representing an countermeasure network model, +.>And->Representing random noise +.>Representing the hadamard product operation.

An antagonism network model is constructed based on an improved deep reinforcement learning algorithm. The challenge network model is shown in fig. 6.

An experience buffer is set and its capacity is set. The experience buffer is shown in fig. 5.

Step 103: and setting the current self state and execution action for the unmanned aerial vehicle in the unmanned aerial vehicle cluster, and setting a reward function of the unmanned aerial vehicle cluster. In this embodiment of the present application, the self state of the unmanned aerial vehicle is divided into the self state of the probe machine and the self state of the attack machine, specifically as follows:

. In (1) the->Indicating the self status of the ith detector in the fighter unmanned aerial vehicle cluster,/->、/>And->Respectively representing the positions of the ith detector in the x-axis, the y-axis and the z-axis,/->、/>And->Respectively representing the speed of the ith detector in the directions of x axis, y axis and z axis,/for the detector >Indicating the roll angle of the ith detector, < +.>Represents the pitch angle of the ith detector, +.>Indicating the yaw angle of the ith detector, < >>List information indicating unmanned aerial vehicle clusters of the opponent detected by the ith detector,/->And g respectively represents the survival state and radar switch information of the detector in the unmanned aerial vehicle cluster of the opposite side detected by the ith detector.

. In (1) the->Indicating the status of the i-th attack machine in the fighter unmanned plane cluster,/->、/>And->Respectively representing the positions of the ith attack machine in the x axis, the y axis and the z axis,/->、/>And->Respectively representing the speed of the ith attack machine in the directions of the x axis, the y axis and the z axis,/for the first attack machine>Indicating the roll angle of the ith attack machine, +.>Representing pitch angle of i-th attack machine, +.>Indicating the yaw angle of the ith attack machine, < +.>List information indicating the opponent unmanned aerial vehicle cluster detected by the ith attack machine, +.>And g respectively represents the survival state of the attack machine in the opposite unmanned aerial vehicle cluster detected by the ith attack machine and radar switch information. c represents the attack times of the ith attack machine.

As shown in fig. 4, the roll angle refers to an angle between the Z axis of the unmanned aerial vehicle body coordinate system and the X axis vertical plane of the unmanned aerial vehicle body coordinate system, the pitch angle refers to an angle between the projection of the speed and the speed of the unmanned aerial vehicle on the horizontal plane, and the yaw angle refers to an angle between the projection of the speed of the unmanned aerial vehicle on the horizontal plane and the north direction.

In this embodiment of the present application, according to the stress situation of the unmanned aerial vehicle, the execution actions of the probe machine may be divided into the following seven types: normal flight, accelerated flight, decelerated flight, left turn, right turn, upward pull, downward dive. The execution actions of the attack machine can be divided into the following eight types: normal flight, accelerated flight, decelerated flight, left turn, right turn, upward pull, downward dive, and attack.

In one embodiment of the present application, a quad may also be usedAnd encoding the execution actions of the detector and the attacker. Wherein (1)>Indicating overload of the drone in the speed direction, < >>Greater than 0 indicates accelerated flight, ">Less than 0 indicates a reduced flight, ">Equal to 0, indicating constant speed flight,/->The overload of the unmanned aerial vehicle in the Z-axis direction of the coordinate system is represented, the overload is used for providing the lifting force of the unmanned aerial vehicle, a certain positive value is required to be kept to ensure the stable flight of the unmanned aerial vehicle, and the unmanned aerial vehicle is +.>The unmanned aerial vehicle driving device is used for indicating the rolling angle of the unmanned aerial vehicle, indicating the rotation angle of the unmanned aerial vehicle around a central shaft, changing the advancing direction of the unmanned aerial vehicle and indicating the attack action of the unmanned aerial vehicle.

Illustratively, the specific performance actions of the drone are encoded as shown in table 1 below.

Table 1-encoding of unmanned aerial vehicle's execution actions

In another embodiment of the present application, the probe is not able to attack, so the Ack total in the quad of the probeIs 0, and those skilled in the art can also eliminate Ack-using tripletsThe motion of the detector is encoded.

In addition, the motion of the unmanned aerial vehicle is approximated as particle motion, and a kinematic equation can be used for modeling the motion process of the unmanned aerial vehicle in a three-dimensional countermeasure environment. The kinematic equation of the unmanned aerial vehicle is as follows:

，/>，/>. Wherein d->、dAnd d->Respectively representing the position change of the unmanned plane in the directions of the x axis, the y axis and the z axis, +.>、/>And->Respectively representing the speeds of the unmanned plane in the directions of the x axis, the y axis and the z axis, d +.>、/>And->Respectively representing the speed change of the unmanned plane in the directions of the x axis, the y axis and the z axis,/->、/>And->Representing the influence of external forces on the x-axis, y-axis and z-axis, respectively, +.>Indicating a change in the roll angle of the unmanned aerial vehicle, +.>Represents the change of pitch angle of the unmanned aerial vehicle, +.>The change of the yaw angle of the unmanned aerial vehicle is represented, dt represents the current time step, p represents the change rate of the roll angle, q represents the change rate of the pitch angle, and r represents the change rate of the yaw angle.

Elapsed timeAnd then, updating the position, speed and course angle of the unmanned aerial vehicle by adopting a numerical integration method.

In the embodiment of the application, a reward rule in the unmanned aerial vehicle cluster countermeasure process is set. Wherein the rewarding rules comprise a probe rewarding rule, an attack rewarding rule and a winning or losing rewarding rule.

The probe rewards rules are shown as the probe rewards function below:

. In (1) the->Representing detectionThe result of the machine's reward function is the value of the probe's reward.

The attack machine rewards rules are as follows:

. In (1) the->Representing the bonus function of the attacker, and obtaining the result as the bonus value of the attacker.

The winning or losing prize rules are shown as the winning or losing prize function of the following sniffer:

. In (1) the->Representing a winning or losing prize function, which results in a winning or losing prize value for the corresponding party. If all unmanned aerial vehicles of one of the two countermeasures are destroyed or the accumulated simulation steps reach 500 steps, the unmanned aerial vehicle is considered to be not destroyed by all unmanned aerial vehicles or the accumulated simulation steps reach 500 steps, and the winning or losing prize value 500 is obtained. Conversely, if one unmanned aerial vehicle is destroyed entirely or the accumulated simulation steps of the other party reach 500 steps, then the one unmanned aerial vehicle is considered to be out of date, punishment is carried out 500, and the winning or losing prize value is-500. If the maximum simulation step number is reached, unmanned aerial vehicles survive on both countermeasures, and the more one of the remaining unmanned aerial vehicles is regarded as the winner, and the winning or losing rewarding value is 300. The other party considers failure, penalizes 300, and wins the prize value of-300. When the number of the unmanned aerial vehicles left by the two countermeasures is equal, the countermeasures are regarded as the tie, and the winning or losing rewards of the two countermeasures are all 0.

Determining a global rewarding function of the unmanned aerial vehicle cluster based on the rewarding rule as follows:

. In (1) the->Representing global prize value,/->Indicating the prize value of the detector +.>Prize value representing attacker, +.>Indicating the winning or losing prize value.

Step 104: the method comprises the steps of obtaining a current observation value of the three-dimensional countermeasure environment observed by the unmanned aerial vehicle cluster, and determining that the execution actions of all unmanned aerial vehicles in the unmanned aerial vehicle cluster form a joint action of the unmanned aerial vehicle cluster.

In the embodiment of the application, before the current observation value of the three-dimensional countermeasure environment observed by the unmanned aerial vehicle cluster is obtained, countermeasure parameters in the countermeasure process are set for the unmanned aerial vehicle cluster. The countermeasure parameters include the number of unmanned aerial vehicles of the countermeasure parties participating in the countermeasure, the number of countermeasure rounds and the maximum interaction length of each round.

In addition, flight constraints are set for the unmanned aerial vehicle to ensure flight safety of the unmanned aerial vehicle in a three-dimensional countermeasure environment. Specifically, the flight area of the unmanned aerial vehicle is set in a three-dimensional countermeasure environment, i.e，/>And. Wherein (1)>、/>And->Respectively representing the positions of the unmanned plane in the x axis, the y axis and the z axisWidth represents the width of the three-dimensional challenge environment, height represents the height of the three-dimensional challenge environment, and depth represents the depth of the three-dimensional challenge environment.

Furthermore, the flying speed and acceleration of the unmanned aerial vehicle are set to be kept within a safe range. Setting the range of course angle, the range of rolling angle isThe pitch angle is in the range +.>The yaw angle is in the range +.>。

In the embodiment of the application, the unmanned aerial vehicle clusters of the two countermeasures are loaded to autonomously detect the observation value of the three-dimensional countermeasures environment, and an epsilon_greedy strategy is adopted to select and execute actions for each unmanned aerial vehicle in the unmanned aerial vehicle clusters, specifically as follows:

. In (1) the->Representing the execution action selected by the unmanned aerial vehicle i at time t, A representing the action set of the execution action of the unmanned aerial vehicle, < ->Representing the expected prize estimate in previous experience, < + >>Representing attenuation factors, the unmanned aerial vehicle is at a time +.>Probability selection of action in which the maximum prize estimate is expected in the past experience, with probability +.>Randomly selecting actions in the action set to realize explorationWith a balance between them.

The execution actions of each unmanned aerial vehicle in the unmanned aerial vehicle clusters of the two countering parties form a combined action.

Step 105: and executing the combined action in the three-dimensional countermeasure environment, so as to obtain the global rewarding value of the unmanned aerial vehicle cluster and the observation value at the next moment, and storing the countermeasure experience in the experience buffer area. In the embodiment of the application, the joint action is executed in the three-dimensional countermeasure environment, the global reward value of the unmanned aerial vehicle cluster is obtained through the global reward function, and meanwhile, the observation value at the next moment is obtained. And storing the counterexperiences consisting of the current observed value, the combined action, the global rewards value and the observed value at the next moment into an experience buffer.

Step 106: a set of challenge experiences are randomly sampled in an experience buffer as samples to train a challenge network model, and the performance of the trained challenge network model is evaluated. Specifically, the implementation steps for training the challenge network model by randomly sampling a set of challenge experiences as samples in the experience buffer are shown in fig. 2, and include steps 201 to 206, which are specifically as follows.

Step 201: a set of challenge experiences are randomly sampled from the experience buffer and a target Q value is determined using a challenge network model based on the sampled challenge experiences. In the embodiment of the application, the antagonism network model is consistent with the deep reinforcement learning algorithm before improvement, and has two important components, namely a main network and a target network. The master network is used to train the countermeasure network model and the target network is used to update the countermeasure network model. Specifically, a set of challenge experiences are randomly sampled from the experience buffer, and the sampled challenge experiences are input as samples into a challenge network model to determine a target Q value, as follows:

. Wherein Q represents a target Q value, +.>Indicating the self-state of the unmanned aerial vehicle at the moment i, < + >>Representing the execution action of the unmanned aerial vehicle at the moment i, < +.>Representing random noise +. >And->Parameters representing the main network in the countermeasure network model, +.>Indicating the immediate rewards obtained after executing the execution action at instant i,/>Representing a discount factor, a representing a set of actions of the drone to perform.

Step 202: and inputting the current self state of each unmanned aerial vehicle in the unmanned aerial vehicle cluster into the countermeasure network model to obtain the current Q value of the corresponding execution action. Specifically, the antagonism network model is used to input the current self state of each unmanned aerial vehicle in the three-dimensional antagonism environment, so as to obtain the current Q value of the corresponding execution action of the unmanned aerial vehicle.

Step 203: and determining a loss value according to the target Q value and the current Q value. In the embodiment of the present application, the formula for determining the loss value is as follows:

. Wherein (1)>，. In (1) the->Indicating a loss value->Representing the target Q value,/->Representing the current Q value, +.>Indicating the self-state of the unmanned aerial vehicle at the moment i, < + >>Representing the execution action of the unmanned aerial vehicle at the moment i, < +.>Representing random noise +.>And->Parameters representing the main network in the countermeasure network model, +.>Indicating the immediate rewards obtained after executing the execution action at instant i,/>Representing discount factors, A representing the set of actions of the unmanned aerial vehicle performing actions, +.>And- >Representing parameters of the target network in the antagonism network model.

Step 204: parameters of the countermeasure network model are updated according to the loss value. In the embodiment of the present application, the formula for updating the parameters of the countermeasure network model according to the loss value is as follows:

. Wherein (1)>. In the formula (-) and->,/>,/>,/>) Parameters representing that the countermeasure network model needs to be trained, < >>Represent learning rate of optimizer, +.>Representing the difference between the target Q value and the current Q value,/or->Representing the value function with respect to the parameter->Is used for the gradient of (a),representing the value function with respect to the parameter->Gradient of->Representing the value function with respect to the parameter->Gradient of->Representing the value function with respect to the parameter->Gradient of->Representing said own state of the drone at instant i, and (2)>Representing said execution action of the unmanned aerial vehicle at instant i,/->Representing random noise +.>And->Parameters representing the primary network in the countermeasure network model.

Step 205: judging whether the iteration times reach the preset training times or not. Specifically, it is determined whether the number of iterations of the countermeasure network model at this time reaches a preset number of training times. If the iteration number of the countermeasure network model at this time does not reach the preset training number, continuing to iteratively execute the steps 201 to 205, and gradually approaching the target Q value to the current Q value through continuous iteration. Otherwise, step 206 is performed.

Step 206: and finishing iteration, and finishing training of the countermeasure network model. Specifically, when the iteration number of the countermeasure network model reaches the preset training number, the iteration is stopped, and the countermeasure network model completes training. After the challenge network model is trained, when the trained challenge network model is applied to perform unmanned plane cluster scout challenge tasks, no noise is required to be added, and the configuration is performedIs->Is 0.

And testing by using the trained countermeasure network model, and evaluating the performance of the two-party unmanned aerial vehicle cluster for reconnaissance countermeasure. The success rate, efficiency and resistance of the two countermeasures can be evaluated.

The success rate assessment is carried out by respectively tracking and recording the times of reconnaissance, attack and final win of the unmanned aerial vehicle clusters of the two countermeasures, and the performance of the unmanned aerial vehicle clusters on the reconnaissance task and the countermeasures task is assessed by calculating the success rate.

The efficiency evaluation is to judge whether the unmanned aerial vehicle cluster has advantages in terms of resource utilization by observing the time and resources spent by the unmanned aerial vehicle clusters of the two counterparties in completing tasks.

The antagonism evaluation is to evaluate the performance of the own unmanned aerial vehicle cluster and the opposite unmanned aerial vehicle cluster in interaction under the condition of simulating the antagonism, and judge the application potential and tactical effect of the unmanned aerial vehicle cluster in actual reconnaissance antagonism.

Although the present application provides method operational steps as described in the examples or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive labor. The order of steps recited in the present embodiment is only one way of performing the steps in a plurality of steps, and does not represent a unique order of execution. When implemented by an actual device or client product, the method of the present embodiment or the accompanying drawings may be performed sequentially or in parallel (e.g., in a parallel processor or a multithreaded environment).

As shown in fig. 3, the embodiment of the application further provides an unmanned aerial vehicle cluster investigation countermeasure device 300 based on deep reinforcement learning. The device comprises: the environment construction module 301, the model construction module 302, the setting module 303, the determining module 304, the countermeasure module 305 and the training evaluation module 306 are specifically as follows:

the construction environment module 301 is used for constructing a countermeasure space including both a three-dimensional countermeasure environment and a countermeasure. The countermeasures comprise unmanned aerial vehicle clusters formed by the same number of detectors and attack machines. The build environment module 301 is specifically configured to establish an appropriate coordinate system in a three-dimensional challenge environment in embodiments of the present application. Illustratively, the direction of the positive east is taken as the direction of the X axis, the direction of the positive north is taken as the direction of the Y axis, and the Z axis points to the sky direction and is perpendicular to the X axis and the Y axis.

Setting up a countermeasure fieldAnd a basic data correlation module of the unmanned aerial vehicle cluster in the scene. In the embodiment of the application, each unmanned aerial vehicle in the unmanned aerial vehicle cluster is an agent denoted by a, and the unmanned aerial vehicle clusters of the opposing parties include n unmanned aerial vehicles in total, and are collectedAnd (3) representing.

The build model module 302 is configured to build an countermeasure network model based on the modified deep reinforcement learning algorithm and to set an experience buffer. The build model module 302 is specifically configured to reconstruct weights and biases of nodes in the neural network into noise weights and noise biases with noise to improve the deep reinforcement learning algorithm; wherein the noise weights and noise offsets are obtained based on training parameters of the countermeasure network model and random noise. The method comprises the following steps:

，/>. In (1) the->And->Representing noiseWeight and noise bias, (->,/>,/>,/>) Parameters representing the need for training of the countermeasure network model, +.>And->Representing random noise +.>Representing the hadamard product operation.

An antagonism network model is constructed based on an improved deep reinforcement learning algorithm.

An experience buffer is set and its capacity is set.

The setting module 303 is configured to set a current self state and an execution action of the unmanned aerial vehicle in the unmanned aerial vehicle cluster, and set a reward function for the unmanned aerial vehicle cluster. The setting module 303 is specifically configured to divide the self state of the unmanned aerial vehicle into the self state of the probe machine and the self state of the attack machine, which are specifically as follows:

. In (1) the->Indicating the self status of the ith detector in the fighter unmanned aerial vehicle cluster,/->、/>And->Respectively representing the positions of the ith detector in the x-axis, the y-axis and the z-axis,/->、/>And->Respectively representing the speed of the ith detector in the directions of x axis, y axis and z axis,/for the detector>Indicating the roll angle of the ith detector, < +.>Represents the pitch angle of the ith detector, +.>Indicating the yaw angle of the ith detector, < >>List information indicating unmanned aerial vehicle clusters of the opponent detected by the ith detector,/->And g respectively represents the survival state and radar switch information of the detector in the unmanned aerial vehicle cluster of the opposite side detected by the ith detector.

In one embodiment of the present application, a quad may also be usedAnd encoding the execution actions of the detector and the attacker. Wherein (1) >Indicating overload of the drone in the speed direction, < >>Greater than 0 indicates accelerated flight, ">Less than 0 indicates a reduced flight, ">Equal to 0, indicating constant speed flight,/->The overload of the unmanned aerial vehicle in the Z-axis direction of the coordinate system is represented, the overload is used for providing the lifting force of the unmanned aerial vehicle, a certain positive value is required to be kept to ensure the stable flight of the unmanned aerial vehicle, and the unmanned aerial vehicle is +.>The unmanned aerial vehicle driving device is used for indicating the rolling angle of the unmanned aerial vehicle, indicating the rotation angle of the unmanned aerial vehicle around a central shaft, changing the advancing direction of the unmanned aerial vehicle and indicating the attack action of the unmanned aerial vehicle.

In another embodiment of the present application, the probe is not attacked, so the Ack in the quad of the probe is always 0, and those skilled in the art can also remove the Ack to use the tripletsThe motion of the detector is encoded.

The probe rewards rules are shown as the probe rewards function below:

. In (1) the->Representing the bonus function of the sonde, the result being the bonus value of the sonde.

The attack machine rewards rules are as follows:

. In (1) the->Representing a winning or losing prize function, which results in a winning or losing prize value for the corresponding party. If all unmanned aerial vehicles of one of the two countermeasures are destroyed or the accumulated simulation steps reach 500 steps, the unmanned aerial vehicle is considered to be not destroyed by all unmanned aerial vehicles or the accumulated simulation steps reach 500 steps, and the winning or losing prize value 500 is obtained. Conversely, if one unmanned aerial vehicle is destroyed entirely or the accumulated simulation steps of the other party reach 500 steps, then the one unmanned aerial vehicle is considered to be out of date, punishment is carried out 500, and the winning or losing prize value is-500. If the maximum simulation step number is reached, unmanned aerial vehicles survive on both countermeasures, and the more one of the remaining unmanned aerial vehicles is regarded as the winner, and the winning or losing rewarding value is 300. The other party considers failure, penalizes 300, and wins the prize value of-300. When the number of unmanned aerial vehicles left by the two countermeasures is equal, the countermeasures are regarded as tiesThe winning or losing prize value of both countermeasures is 0.

. In (1) the->Representing global prize value,/- >Indicating the prize value of the detector +.>Prize value representing attacker, +.>Indicating the winning or losing prize value.

The determining module 304 is configured to obtain a current observation value of the three-dimensional countermeasure environment observed by the unmanned aerial vehicle cluster, and determine that the execution actions of each unmanned aerial vehicle in the unmanned aerial vehicle cluster form a joint action of the unmanned aerial vehicle cluster. The determining module 304 is specifically configured to set a countermeasure parameter in a countermeasure process for the unmanned aerial vehicle cluster before obtaining a current observed value of the three-dimensional countermeasure environment observed by the unmanned aerial vehicle cluster. The countermeasure parameters include the number of unmanned aerial vehicles of the countermeasure parties participating in the countermeasure, the number of countermeasure rounds and the maximum interaction length of each round.

In addition, flight constraints are set for the unmanned aerial vehicle to ensure flight safety of the unmanned aerial vehicle in a three-dimensional countermeasure environment. Specifically, the flight area of the unmanned aerial vehicle is set in a three-dimensional countermeasure environment, i.e，/>And. Wherein (1)>、/>And->The positions of the unmanned plane in the x-axis, the y-axis and the z-axis are respectively represented, width represents the width of the three-dimensional countermeasure environment, height represents the height of the three-dimensional countermeasure environment, and depth represents the depth of the three-dimensional countermeasure environment.

Furthermore, the flying speed and acceleration of the unmanned aerial vehicle are set to be kept within a safe range. Setting the range of course angle, the range of rolling angle is The pitch angle is in the range +.>The yaw angle is in the range +.>。

. In (1) the->Representing the execution action selected by the unmanned aerial vehicle i at time t, A representing the action set of the execution action of the unmanned aerial vehicle, < ->Representing the expected prize estimate in previous experience, < + >>Representing attenuation factors, unmannedEvery time by +.>Probability selection of action in which the maximum prize estimate is expected in the past experience, with probability +.>The actions are randomly selected in the action set, so that the balance between exploration and utilization is realized.

The countermeasure module 305 is configured to perform a joint action in a three-dimensional countermeasure environment, so as to obtain a reward value of the unmanned aerial vehicle cluster and an observation value at a next moment, and store countermeasure experiences in the experience buffer. Wherein the countermeasures include current observations, joint actions, rewards, and observations at the next time. The countermeasure module 305 is specifically configured to perform a joint action in the three-dimensional countermeasure environment, obtain a global prize value of the unmanned aerial vehicle cluster through the global prize function, and obtain an observed value at a next moment. And storing the counterexperiences consisting of the current observed value, the combined action, the global rewards value and the observed value at the next moment into an experience buffer.

The training evaluation module 306 is configured to randomly sample a set of challenge experiences in the experience buffer as samples to train the challenge network model and evaluate the performance of the trained challenge network model. The training evaluation module 306 is specifically configured to randomly sample a set of challenge experiences from the experience buffer and determine a target Q value using a challenge network model based on the sampled challenge experiences. In the embodiment of the application, the antagonism network model is consistent with the deep reinforcement learning algorithm before improvement, and has two important components, namely a main network and a target network. The master network is used to train the countermeasure network model and the target network is used to update the countermeasure network model. Specifically, a set of challenge experiences are randomly sampled from the experience buffer, and the sampled challenge experiences are input as samples into a challenge network model to determine a target Q value, as follows:

. Wherein Q represents a target Q value, +.>Indicating the self-state of the unmanned aerial vehicle at the moment i, < + >>Representing the execution action of the unmanned aerial vehicle at the moment i, < +.>Representing random noise +.>And->Parameters representing the main network in the countermeasure network model, +.>Indicating the immediate rewards obtained after executing the execution action at instant i,/>Representing a discount factor, a representing a set of actions of the drone to perform.

And inputting the current self state of each unmanned aerial vehicle in the unmanned aerial vehicle cluster into the countermeasure network model to obtain the current Q value of the corresponding execution action. Specifically, the antagonism network model is used to input the current self state of each unmanned aerial vehicle in the three-dimensional antagonism environment, so as to obtain the current Q value of the corresponding execution action of the unmanned aerial vehicle.

And determining a loss value according to the target Q value and the current Q value. In the embodiment of the present application, the formula for determining the loss value is as follows:

. Wherein (1)>，. In (1) the->Indicating a loss value->Representing the target Q value,/->Representing the current Q value, +.>Indicating the self-state of the unmanned aerial vehicle at the moment i, < + >>Representing the execution action of the unmanned aerial vehicle at the moment i, < +.>Representing random noise +.>And->Parameters representing the main network in the countermeasure network model, +.>Indicating the immediate rewards obtained after executing the execution action at instant i,/>Representing discount factors, A representing the set of actions of the unmanned aerial vehicle performing actions, +.>And->Representing parameters of the target network in the antagonism network model.

Parameters of the countermeasure network model are updated according to the loss value. In the embodiment of the present application, the formula for updating the parameters of the countermeasure network model according to the loss value is as follows:

. Wherein (1)>. In the formula (-) and- >,/>,/>,/>) Parameters representing that the countermeasure network model needs to be trained, < >>Represent learning rate of optimizer, +.>Representing the difference between the target Q value and the current Q value,/or->Representing the value function with respect to the parameter->Is used for the gradient of (a),representing the value function with respect to the parameter->Gradient of->Representing the value function with respect to the parameter->Gradient of->Representing the value function with respect to the parameter->Gradient of->Representing said own state of the drone at instant i, and (2)>Representing said execution action of the unmanned aerial vehicle at instant i,/->Representing random noise +.>And->Parameters representing the primary network in the countermeasure network model.

Judging whether the iteration times reach the preset training times or not. Specifically, it is determined whether the number of iterations of the countermeasure network model at this time reaches a preset number of training times. If the iteration number of the countermeasure network model at the moment does not reach the preset training number, continuing iteration. The target Q value gradually approaches to the current Q value through continuous iteration. Otherwise, the iteration is finished, and training of the countermeasure network model is completed. Specifically, when the iteration number of the countermeasure network model reaches the preset training number, the iteration is stopped, and the countermeasure network model completes training. After the challenge network model is trained, when the trained challenge network model is applied to perform unmanned plane cluster scout challenge tasks, no noise is required to be added, and the configuration is performed Is->Is 0.

Some of the modules of the apparatus described herein may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The apparatus or module set forth in the embodiments of the application may be implemented in particular by a computer chip or entity, or by a product having a certain function. For convenience of description, the above devices are described as being functionally divided into various modules, respectively. The functions of the modules may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present application. Of course, a module that implements a certain function may be implemented by a plurality of sub-modules or a combination of sub-units.

In addition, each functional module in the embodiments of the present invention may be integrated into one processing module, each module may exist alone, or two or more modules may be integrated into one module.

In this specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment is mainly described as a difference from other embodiments. All or portions of the present application can be used in a number of general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, mobile communication terminals, multiprocessor systems, microprocessor-based systems, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the present application; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions.

Claims

1. The unmanned aerial vehicle cluster investigation countermeasure method based on deep reinforcement learning is characterized by comprising the following steps of:

constructing a countermeasure space comprising a three-dimensional countermeasure environment and two countermeasures; wherein, the countermeasures both sides comprise unmanned aerial vehicle clusters of both sides formed by the same number of detectors and attack machines;

constructing an countermeasure network model based on an improved deep reinforcement learning algorithm, and setting an experience buffer zone;

setting a current self state and execution action for the unmanned aerial vehicle in the unmanned aerial vehicle cluster, and setting a global rewarding function of the unmanned aerial vehicle cluster;

acquiring a current observation value of the three-dimensional countermeasure environment observed by the unmanned aerial vehicle cluster, and determining that the execution action of each unmanned aerial vehicle in the unmanned aerial vehicle cluster forms a joint action of the unmanned aerial vehicle cluster;

Executing the combined action in the three-dimensional countermeasure environment, so as to obtain a global rewarding value of the unmanned aerial vehicle cluster and an observation value at the next moment, and storing countermeasure experiences in the experience buffer area; wherein the countermeasures include the current observation, the joint action, the global rewards value and the next time observation;

randomly sampling a set of the challenge experiences in the experience buffer as samples to train the challenge network model, and evaluating the performance of the challenge network model after training.

2. The method of claim 1, wherein the improved deep reinforcement learning algorithm comprises:

reconstructing weights and biases of nodes in the neural network into noise weights and noise biases with noise so as to improve a deep reinforcement learning algorithm; wherein the noise weights and the noise offsets are obtained based on training parameters of the countermeasure network model and random noise.

3. The method according to claim 1, wherein the self-state is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->Representing said self-state of an ith probe in said unmanned aerial vehicle cluster of the opposing party,/ >、/>And->Respectively representing the positions of the ith detector in the x-axis, the y-axis and the z-axis,/->、/>And->Respectively representing the speed of the ith detector in the directions of x axis, y axis and z axis,/for the detector>Indicating the roll angle of the ith detector, < +.>Represents the pitch angle of the ith detector, +.>Indicating the yaw angle of the ith detector, < >>List information of said unmanned aerial vehicle clusters representing the opponents detected by the ith detector,/->G respectively represents the survival state and radar switch information of the detector in the unmanned aerial vehicle cluster of the opposite side detected by the ith detector;

the method comprises the steps of carrying out a first treatment on the surface of the In (1) the->Indicating the status of the i-th attack machine in the unmanned aerial vehicle cluster of the opponent party,/->、/>And->Respectively representing the positions of the ith attack machine in the x axis, the y axis and the z axis,/->、/>And->Respectively representing the speed of the ith attack machine in the directions of the x axis, the y axis and the z axis,/for the first attack machine>Indicating the roll angle of the ith attack machine, +.>Representing pitch angle of i-th attack machine, +.>Indicating the yaw angle of the ith attack machine, < +.>List information indicating the opponent unmanned aerial vehicle cluster detected by the ith attack machine, +.>G respectively represents the survival state of the attack machine in the opposite unmanned aerial vehicle cluster detected by the ith attack machine and radar switch information; c represents the attack times of the ith attack machine.

4. The method of claim 1, wherein the setting a global rewards function for the drone cluster comprises:

setting a reward rule in the unmanned aerial vehicle cluster countermeasure process; wherein, the rewarding rule comprises a detecting machine rewarding rule, an attacking machine rewarding rule and a winning or losing rewarding rule;

the global rewards function is determined based on the rewards rule.

5. The method of claim 1, wherein prior to the obtaining the current observations of the three-dimensional challenge environment by the drone cluster, further comprising:

setting countermeasure parameters in the countermeasure process for the unmanned aerial vehicle cluster; wherein the challenge parameters include the number of unmanned aerial vehicles of the challenge parties participating in the challenge, the number of challenge rounds, and the maximum interaction length per round.

6. The method of claim 1, wherein randomly sampling a set of the challenge experiences in the experience buffer as samples to train the challenge network model comprises:

iteratively executing the updating step until the iteration times of the countermeasure network model reach preset training times;

the updating step includes:

randomly sampling a set of said challenge experiences from said experience buffer and determining a target Q value using said challenge network model based on said sampled challenge experiences;

Inputting the current self state of each unmanned aerial vehicle in the unmanned aerial vehicle cluster into the countermeasure network model to obtain a current Q value corresponding to the execution action;

and determining a loss value according to the target Q value and the current Q value, and updating parameters of the countermeasure network model according to the loss value.

7. The method of claim 6, wherein the formula for determining the target Q value is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein Q represents the target Q value,representing said own state of the drone at instant i, and (2)>Representing said execution action of the unmanned aerial vehicle at instant i,/->Representing random noise +.>And->Parameters representing the main network in the countermeasure network model, < >>Representing an instant prize obtained after performing said performing action at instant i, +_>Representing a discount factor, a representing the set of actions of the unmanned aerial vehicle that perform the action.

8. The method of claim 6, wherein the formula for determining the loss value is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>，The method comprises the steps of carrying out a first treatment on the surface of the In (1) the->Indicating a loss value->Representing the target Q value,/>Representing the current Q value,/>Representing said own state of the drone at instant i, and (2)>Representing said execution action of the unmanned aerial vehicle at instant i,/- >Representing the random noise and the noise level of the random noise,and->Parameters representing the main network in the countermeasure network model, < >>Representing an instant prize obtained after performing said performing action at instant i, +_>Representing a discount factor, A representing an action set of said execution actions of the unmanned aerial vehicle, +.>And->Parameters representing the target network in the countermeasure network model.

9. The method of claim 6, wherein the formula for updating the parameters of the countermeasure network model based on the loss value is as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The method comprises the steps of carrying out a first treatment on the surface of the In the formula (-) and->,/>,/>,/>) Parameters representing that the countermeasure network model needs to be trained, < >>Represent learning rate of optimizer, +.>Representing the difference between the target Q value and the current Q value,/or->Representing the value function with respect to the parameter->Is used for the gradient of (a),representing the value function with respect to the parameter->Gradient of->Representing the value function with respect to the parameter->Gradient of->Representing the value function with respect to the parameter->Gradient of->Representing said own state of the drone at instant i, and (2)>Representing said execution action of the unmanned aerial vehicle at instant i,/->Representing random noise +.>And->Parameters representing the primary network in the countermeasure network model.

10. Unmanned aerial vehicle cluster investigation countermeasure device based on degree of depth reinforcement study, characterized by comprising:

The construction environment module is used for constructing a countermeasure space comprising a three-dimensional countermeasure environment and a countermeasure party; wherein, the two countermeasures comprise unmanned aerial vehicle clusters formed by the same number of detectors and attack machines;

the model building module is used for building an countermeasure network model based on an improved deep reinforcement learning algorithm and setting an experience buffer area;

the setting module is used for setting the current self state and execution action of the unmanned aerial vehicle in the unmanned aerial vehicle cluster and setting a reward function for the unmanned aerial vehicle cluster;

the determining module is used for obtaining a current observation value of the three-dimensional countermeasure environment observed by the unmanned aerial vehicle cluster and determining that the execution actions of all unmanned aerial vehicles in the unmanned aerial vehicle cluster form a joint action of the unmanned aerial vehicle cluster;

the countermeasure module is used for executing the combined action in the three-dimensional countermeasure environment so as to obtain a reward value of the unmanned aerial vehicle cluster and an observation value at the next moment, and storing countermeasure experience into the experience buffer area; wherein the countermeasures include the current observation, the joint action, the reward value, and the observation at the next time;

the training evaluation module is used for randomly sampling a group of countermeasure experiences in the experience buffer area as samples to train the countermeasure network model, and evaluating the performance of the trained countermeasure network model.