CN113268081B

CN113268081B - Small unmanned aerial vehicle prevention and control command decision method and system based on reinforcement learning

Info

Publication number: CN113268081B
Application number: CN202110602580.1A
Authority: CN
Inventors: 刘阳; 温志津; 牛余凯; 晋晓曦; 李晋徽
Original assignee: 32802 Troops Of People's Liberation Army Of China
Current assignee: 32802 Troops Of People's Liberation Army Of China
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-11-09
Anticipated expiration: 2041-05-31
Also published as: CN113268081A

Abstract

The invention discloses a small unmanned aerial vehicle prevention and control command decision method based on reinforcement learning, which comprises the following steps: determining the composition of a small unmanned aerial vehicle prevention and control system; the small unmanned aerial vehicle prevention and control system comprises a detection subsystem, a disposal subsystem and a command control system; the detection subsystem is used for providing combat situation information, and the disposal subsystem is responsible for implementing prevention and control disposal; establishing a three-degree-of-freedom particle motion model of the small unmanned aerial vehicle; constructing a prevention and control command decision model; training and optimizing a small unmanned aerial vehicle prevention and control command decision model; and verifying and evaluating the prevention and control effect of the prevention and control command decision model. The invention also discloses a small unmanned aerial vehicle prevention and control command decision system based on reinforcement learning, which comprises a multi-source data fusion module, a situation analysis module, a prevention and control planning module and an effect evaluation module. The invention solves the problems of low decision speed, difficulty in processing complex scenes and the like in the existing prevention and control command decision system, and can be widely applied to the fields of small unmanned aerial vehicle management and control, civil supervision and military.

Description

Small unmanned aerial vehicle prevention and control command decision method and system based on reinforcement learning

Technical Field

The invention belongs to the technical field of command control, and particularly relates to a small unmanned aerial vehicle prevention and control command decision method and system based on reinforcement learning.

Background

At present, for the detection and processing problems of the 'low-speed small' unmanned aerial vehicle, many relevant mature technologies and achievements exist at home and abroad, but in the aspect of generating a specific disposal strategy by using detection information and how to construct a small unmanned aerial vehicle control command decision system and other problems, a commander still needs to make an artificial decision at present, and an operator finishes a relevant disposal instruction of the unmanned aerial vehicle according to a decision result.

Considering the intelligent technology development level of the current command control system, the existing small unmanned aerial vehicle prevention and control command control system mainly has the following problems: (1) at present, the control work of the small unmanned aerial vehicle is mainly completed manually by an operator, and the command automation degree is extremely low; (2) the small unmanned aerial vehicle control belongs to short-range defense, the command decision time is short, the response speed is high, the response time of manual operation is difficult to meet the defense requirement, and the difference between coping and dealing with multiple targets is more obvious; (3) the situation of the small unmanned aerial vehicle is complex and varies, and the existing control system and process based on experience rules are difficult to adapt to the control requirements. The small unmanned aerial vehicle prevention and control command decision method based on the reinforcement learning training algorithm model is not applied to the existing products or the small unmanned aerial vehicle prevention and control command decision system.

Disclosure of Invention

Aiming at the problem of automatic generation of process strategies such as detection, analysis, prevention and control command control, scheduling and handling of low-altitude targets such as small unmanned aerial vehicles and the like under complex scenes such as cities and the like, the invention discloses a small unmanned aerial vehicle prevention and control command decision method and system based on reinforcement learning, which realize the efficient conversion of comprehensive situation data for the prevention and control of small unmanned aerial vehicles into prevention and control handling schemes and instructions for the unmanned aerial vehicles, can access multi-source detection means and multi-element handling means for command decision, realize the effective promotion of intelligent decision level in 4 small unmanned aerial vehicle prevention and control command flow stages including situation fusion, threat analysis, planning schemes and handling control, solve the problems of low decision speed, difficulty in handling complex scenes and the like in the existing prevention and control command system, and meet the prevention and control requirements of the small unmanned aerial vehicles. The small unmanned aerial vehicle generally refers to an unmanned aerial vehicle with takeoff weight not more than 25 kilograms, and comprises two types of fixed wings and rotary wings, and has the characteristics of low cost, strong maneuverability and the like.

The invention discloses a small unmanned aerial vehicle prevention and control command decision method based on reinforcement learning, which comprises the following steps:

s1, determining the composition of a small unmanned aerial vehicle prevention and control system;

s2, establishing a three-degree-of-freedom particle motion model of the small unmanned aerial vehicle;

s3, constructing a small unmanned aerial vehicle prevention and control command decision model;

s4, training and optimizing a small unmanned aerial vehicle prevention and control command decision model;

and S5, verifying and evaluating the prevention and control effect of the small unmanned aerial vehicle prevention and control command decision model.

Further, the step S1 specifically includes: determining the composition of a small unmanned aerial vehicle prevention and control system, wherein the small unmanned aerial vehicle prevention and control system comprises a detection subsystem, a disposal subsystem and a command control system; the system comprises a detection subsystem, a disposal subsystem, a command control system and a command control system, wherein the detection subsystem is used for providing combat situation information, the disposal subsystem is responsible for implementing prevention and control disposal, and the command control system is used for receiving the combat situation information from the detection subsystem and scheduling a plurality of disposal means to generate a disposal strategy; the detection subsystem comprises single-type or multi-type detection equipment, and the disposal subsystem comprises multi-type soft killing disposal equipment and hard interception disposal equipment; the command control system comprises a multi-source data fusion module, a situation analysis module, a prevention and control planning module and an effect evaluation module;

specifically, the detection subsystem comprises radar detection equipment, photoelectric detection equipment and radio detection equipment, and the treatment subsystem comprises radio interference equipment and laser interception equipment;

further, the step S2 specifically includes: in the unmanned aerial vehicle prevention and control operation, mainly prevent and control the processing according to information such as the target position that the subsystem obtained of surveying, speed, consequently the key is the model of research prevention and control object in the prevention and control operation, regards unmanned aerial vehicle as the particle, establishes its three degree of freedom particle motion model:

wherein (x, y, z) represents the coordinates of the small unmanned aerial vehicle in a three-dimensional space coordinate system of the earth, v, theta and psi respectively represent the flight speed, the pitch angle and the yaw angle of the small unmanned aerial vehicle, and t represents time.

Further, the step S3 specifically includes: the treatment equipment of the unmanned aerial vehicle prevention and control system comprises laser interception equipment and radio interference equipment, wherein actions of the laser equipment comprise four actions of opening the laser equipment, closing the laser equipment, keeping the equipment state and adjusting laser pointing direction, and actions of the radio interference equipment comprise four actions of opening interference, closing interference, keeping action and adjusting interference pointing direction. And performing action coding on various actions of the disposal equipment by adopting a three-bit binary number, wherein the first bit of the three-bit binary number represents the type of the equipment, and the last two bits of the three-bit binary number are used for representing the corresponding specific actions of the equipment, namely, the action taken by the disposal equipment of the prevention and control system is represented by a triple group formed by the three-bit binary number.

According to the characteristics of the small unmanned aerial vehicle prevention and control task and the Markov decision process, a small unmanned aerial vehicle prevention and control command decision model is established, a state space and a disposal decision space are designed, and a reward function is determined according to the prevention and control intention of a small unmanned aerial vehicle prevention and control system;

the small unmanned aerial vehicle control command decision model is established by adopting a reinforcement learning algorithm, interaction between the intelligent decision model and the environment is described by adopting a Markov decision process in reinforcement learning, and the Markov decision process is realized by utilizing a state space, an action space, a reward function and a discount coefficient;

the expression of the state space S of the unmanned aerial vehicle prevention and control command decision model is as follows:

S＝[d_t，v_t，θ_t，ψ_t，，t_l，t_j]，

wherein d is_tThe expression of (a) is:

wherein the content of the first and second substances,

and

respectively representing the position coordinates of the small unmanned aerial vehicle at the time t and the time t-delta t, (x)_a，y_a，z_a) Representing the position coordinates of the detection device, at representing the stepping time interval of the Markov decision process; d_tThe distance between the small unmanned aerial vehicle and the detection equipment at the moment t is represented; v. of_tRepresenting the flight speed of the small unmanned aerial vehicle at the moment t; t is t_lRepresenting the light emitting time of the laser interception equipment; t is t_jRepresenting the time when the radio interference device is on; theta and psi are denoted as the pitch angle and yaw angle of the drone, respectively.

The expression of the action space A of the unmanned aerial vehicle control command decision model is A ═ D_t，D_a1，D_a2]Wherein the device type D_tThe value is 0 or 1, and the action type of the equipment is determined by an action variable D_a1And D_a2Is a combination of (1) represents an action variable [ D ]_a1，D_a2]The specific values of (a) include four combinations of 00, 01, 10 and 11.

When the prevention and control intention of the small unmanned aerial vehicle prevention and control system is the prevention and control target of medium and long distance, the defense success condition at the moment is expressed by the reward function of each flight component of the small unmanned aerial vehicle,

wherein R is_a、R_dAnd R_vRespectively representing an angle reward function, a distance reward function and a speed reward function; q represents an included angle between the speed vector of the small unmanned aerial vehicle and a connecting line of the small unmanned aerial vehicle and the detection equipment; q. q.s_mRepresenting the angle value when the angle reward value is the minimum reward positive value;

respectively indicating that the detection equipment is within the visual line angle range of the unmanned aerial vehicle and is away from the unmanned aerial vehicleOpening the reward value of the unmanned aerial vehicle line-of-sight angle range, wherein when the angle q is 0, the angle reward value is minimum; when the angle q is pi, the angle award value is maximum. The distance reward function is expressed by a linear function related to the distance, k is a smooth coefficient keeping the distance reward function at the minimum reward positive value, d_fAnd d_cRespectively representing the maximum radius of a prevention and control area of the small unmanned aerial vehicle and the minimum detection distance of detection equipment;

respectively representing reward coefficients corresponding to the fact that the flying speed of the small unmanned aerial vehicle is lower than a certain flying speed threshold value and higher than a maximum flying speed threshold value; v. of_min，v_max，v_xhRespectively representing the minimum flying speed, the maximum flying speed and the cruising flying speed of the small unmanned aerial vehicle.

R is to be_a，R_dAnd R_vAnd performing weighted summation to obtain an expression of a reward function R of the small unmanned aerial vehicle prevention and control command decision model, wherein the expression specifically comprises the following steps:

R＝a₁·R_a+a₂·R_d+a₃·R_v，

wherein, a₁，a₂，a₃The weights corresponding to the angle reward function, the distance reward function and the speed reward function can be obtained according to empirical values, and satisfy constraint conditions: a is₁+a₂+a₃＝1，a₁，a₂，a₃≥0。

Further, the step 4 specifically includes: the method comprises the steps of training a small unmanned aerial vehicle prevention and control command decision model by using a Deep Q Network algorithm, namely a DQN algorithm for short, until the small unmanned aerial vehicle prevention and control command decision model can generate prevention and control treatment strategies aiming at driving away and damage striking of the small unmanned aerial vehicle executing different tasks (such as striking and reconnaissance), stopping training and storing neural Network model parameters at the moment when the defense success rate of the strategies exceeds a certain threshold value, and completing the training and optimization of the small unmanned aerial vehicle prevention and control command decision model.

In the DQN algorithm, a value evaluation network and a value target network are constructed, the output value of the value evaluation network is represented as Q (s, a | theta), the input of the value evaluation network is a handling action variable a taken at the previous moment and a state variable s at the current moment, the output of the value evaluation network is a handling action variable taken at the next moment, the corresponding value evaluation network parameter is theta, the value evaluation network adopts a mode of minimizing the difference between the state action value of the value evaluation network and the state action value of the value target network to update and optimize the value evaluation network parameter theta, and the Q (s, a | theta) value output by the value evaluation network is directly output by the network; the value target network output value is expressed as

The input of the method is a treatment action variable a taken at the last moment and a state variable s at the moment, and the corresponding value target network parameter is theta^-(ii) a Output of value target network

Value output by value target network and reward r_jThe specific expression is as follows:

where the index j indicates the number of the jth data in the experience pool taken dataset, r_jIndicates the reward, s, corresponding to the j-th data_jA state variable corresponding to the j-th data_jA treatment action variable s representing the j-th data_j+1Indicating that the experience pool adopts the state variable, a, corresponding to the j +1 th data in the data set_j+1Representing that the experience pool adopts the treatment action change corresponding to the j +1 th data in the data setThe amount of the compound (A) is,

representing the value target network output corresponding to the jth data

The value, γ is the reward discount factor, L (θ) represents the loss function used in training the value assessment network with parameter θ,

represents a state variable s_j+1Take action a_j+1Maximum of value target network output

The value of the one or more of the one,

represents a state variable s_j+1Take action a_j+1And finally, obtaining the least square error between the predicted value of the value target network and the real value of the target.

For the value evaluation network, the parameter θ is updated toward the direction of increasing value of the value evaluation network output value, and the process is expressed as:

wherein the content of the first and second substances,

represents a state variable s_jAnd an action variable a_jCorresponding to the gradient of the Q-value function over the parameter theta,

represents the gradient of the loss function L (theta) to the parameter theta; by adopting the method of temporarily freezing the parameters of the value target network, after reaching a certain training period of the value evaluation network, the parameters of the value target network are updatedOnly the value evaluation network parameter theta is transmitted to the value target network parameter theta^-Therefore, the stage fixity of the value target network is kept, and the stability of algorithm training is improved;

the value target network and the value evaluation network both adopt a neural network architecture formed by full connection layers, 3 full connection layers are arranged on the value target network and the value evaluation network, and 200, 100 and 50 neurons are respectively selected from the 3 full connection layers.

Further, the step S5 specifically includes: and loading the small unmanned aerial vehicle control command decision model obtained in the training of the step S4 in a small unmanned aerial vehicle control actual scene, making a decision according to a state space obtained in real time from the small unmanned aerial vehicle control actual scene to obtain a handling action variable a, applying the handling action variable a to the actual scene, immediately obtaining a small unmanned aerial vehicle control strategy, changing an environmental state and obtaining real-time reward feedback.

The invention discloses a small unmanned aerial vehicle prevention and control command decision system based on reinforcement learning, which comprises a multi-source data fusion module, a situation analysis module, a prevention and control planning module and an effect evaluation module, wherein the four modules are sequentially connected;

the multi-source data fusion module is used for fusing data acquired by detecting the prevention and control environment and the target by the multi-type detection equipment;

the situation analysis module is used for performing attribute analysis and judgment and threat assessment on multi-source target data obtained by the multi-type detection equipment;

the control planning module is used for realizing the small unmanned aerial vehicle control decision method based on reinforcement learning to obtain a small unmanned aerial vehicle control command decision model, and automatically generating a small unmanned aerial vehicle control disposal decision scheme according to threat judgment information obtained by the situation analysis module;

the effect evaluation module analyzes and processes the real-time prevention and control environment situation, the damage degree of the prevention and control target and the specific striking effect of the prevention and control disposal equipment, evaluates the prevention and control effect of the prevention and control disposal decision scheme of the small unmanned aerial vehicle, and provides real-time feedback for the prevention and control command decision action of the unmanned aerial vehicle.

Further, the multi-source data fusion module extracts, manages and organizes information of data obtained by the multi-type detection equipment according to the prevention and control target type, the prevention and control environment elements, the prevention and control target elements, the disposal elements and the like;

furthermore, the situation analysis module performs attribute analysis and judgment on multi-source target data in the whole process of prevention and control judgment, constructs a threat level model for threat assessment, obtains threat judgment information, is used for mastering the threat degree of a related target, and uploads the threat judgment information to the prevention and control planning module.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention provides a small unmanned aerial vehicle prevention and control command decision method and a system based on reinforcement learning, wherein a reinforcement learning theory is combined with a small unmanned aerial vehicle prevention and control decision model, so that the automatic generation of comprehensive situation data for the prevention and control of a small unmanned aerial vehicle is realized, and a prevention and control disposal scheme and instructions for the unmanned aerial vehicle are efficiently generated by utilizing the data;

(2) the invention provides a small unmanned aerial vehicle prevention and control command decision method and system based on reinforcement learning, which realize situation fusion, threat analysis, planning scheme and treatment control, and improve the intelligent decision level of 4 unmanned aerial vehicle prevention and control command flow stages, solve the problems of low decision speed, difficulty in processing complex scenes and the like in the conventional prevention and control command decision system, and provide a new technical thought for small unmanned aerial vehicle prevention and control command decision.

(3) The invention provides a method and a system for small unmanned aerial vehicle prevention and control command decision based on reinforcement learning, which can be widely applied to the fields of small unmanned aerial vehicle management and control, civil supervision and military.

Drawings

Fig. 1 is a flow chart of a control command decision method of a small unmanned aerial vehicle based on reinforcement learning according to the invention;

FIG. 2 is a flow chart of a deep Q network algorithm in the present invention;

fig. 3 is a composition diagram of a small unmanned aerial vehicle prevention and control command decision system based on reinforcement learning.

Detailed Description

For a better understanding of the present disclosure, an example is given here.

In order to facilitate understanding of those skilled in the art, the method and system for unmanned aerial vehicle prevention and control command decision based on reinforcement learning provided by the invention are further described in detail with reference to the accompanying drawings and specific embodiments.

s2, constructing a three-degree-of-freedom particle motion model of the small unmanned aerial vehicle;

S＝[d_t，v_t，θ_t，ψ_t，，t_l，t_j]，

wherein d is_tThe expression of (a) is:

wherein the content of the first and second substances,

and

respectively representing reward values of the detection equipment in the range of the line-of-sight angle of the unmanned aerial vehicle and reward values of the detection equipment out of the range of the line-of-sight angle of the unmanned aerial vehicle, wherein when the angle q is 0, the reward value of the angle is minimum; when the angle q is pi, the angle award value is maximum. The distance reward function is expressed by a linear function related to the distance, k is a smooth coefficient keeping the distance reward function at the minimum reward positive value, d_fAnd d_cRespectively representing the maximum radius of a prevention and control area of the small unmanned aerial vehicle and the minimum detection distance of detection equipment;

R＝a₁·R_a+a₂·R_d+a₃·R_v，

wherein, a₁，a₂，a₃The weights respectively corresponding to the angle reward function, the distance reward function and the speed reward function can be obtained according to the empirical valueIt satisfies the constraint condition: a is₁+a₂+a₃＝1，a₁，a₂，a₃≥0。

where the index j indicates the number of the jth data in the experience pool taken dataset, r_jIndicates the reward, s, corresponding to the j-th data_jA state variable corresponding to the j-th data_jA treatment action variable s representing the j-th data_j+1Indicating that the experience pool adopts the state variable, a, corresponding to the j +1 th data in the data set_j+1Indicating that the experience pool adopts the treatment action variable corresponding to the j +1 th data in the data set,

representing the value target network output corresponding to the jth data

The value of the one or more of the one,

wherein the content of the first and second substances,

represents the gradient of the loss function L (theta) to the parameter theta; by adopting a method of temporarily freezing the value target network parameters, after a certain training period of the value evaluation network is reached, the parameters of the value target network are updated, and the value evaluation network parameters theta are transmitted to the value target network parameters theta^-Therefore, the stage fixity of the value target network is kept, and the stability of algorithm training is improved;

the value target network and the value evaluation network both adopt a neural network architecture formed by full connection layers, the neural network architecture and the value evaluation network are provided with 3 full connection layers in total, and 200, 100 and 50 neurons are respectively selected from the 3 full connection layers.

the data fusion module is used for fusing data acquired by detecting the prevention and control environment and the target by the multi-type detection equipment;

Referring to fig. 1, the small unmanned aerial vehicle prevention and control command decision method based on reinforcement learning of the present invention includes the following steps:

step 1, defining the composition of a small unmanned aerial vehicle prevention and control system. Determining the composition of a small unmanned aerial vehicle prevention and control system, wherein the small unmanned aerial vehicle prevention and control system comprises a detection subsystem, a disposal subsystem and a command control system; the system comprises a detection subsystem, a disposal subsystem and a command control system, wherein the detection subsystem is used for providing combat situation information, the disposal subsystem is responsible for implementing prevention and control disposal, and the command control system is used for receiving the combat situation information and generating a disposal strategy; the detection subsystem comprises radar detection equipment, photoelectric detection equipment and radio detection equipment, the treatment subsystem comprises radio interference equipment and laser interception equipment, and the command control system comprises a data fusion module, a situation analysis module, a prevention and control planning module and an effect evaluation module;

under the condition that the small unmanned aerial vehicle prevention and control system is considered to be composed of 1 set of detection subsystem, 1 set of treatment subsystem and an instruction control system, the detection subsystem comprises 1 station of each of radar, photoelectric detection equipment and radio detection equipment, and the treatment subsystem comprises 1 station of each of radio interference equipment and laser interception equipment. The command control system is composed of data fusion, situation analysis, prevention and control planning and effect evaluation modules.

And 2, constructing a three-degree-of-freedom particle motion model of the small unmanned aerial vehicle. In the small unmanned aerial vehicle prevention and control operation, prevention and control treatment is mainly carried out according to information such as target positions and speeds acquired by the detection subsystem, so that the important point is to research a model of a prevention and control target in the prevention and control operation, regard the model as a particle and research a three-degree-of-freedom particle model:

wherein (x, y, z) represents the coordinate of the small unmanned aerial vehicle in a three-dimensional space with the ground as a reference system, and v, theta and psi respectively represent the flight speed, the pitch angle and the yaw angle of the small unmanned aerial vehicle.

In this embodiment, it is assumed that N drones executing reconnaissance and strike tasks are initialized randomly outside the protection area where the drone protection and control system is located, and coordinate information of the drones is (x)_i，y_i，z_i)，i＝1…N。

And 3, constructing a small unmanned aerial vehicle prevention and control command decision model. Establishing a small unmanned aerial vehicle prevention and control command decision model according to the small unmanned aerial vehicle prevention and control task characteristics and the Markov decision process, designing a state space and a disposal decision space, and determining a reward function according to the intentions of different targets to be prevented and controlled;

in the invention, the small unmanned aerial vehicle prevention and control command decision model is established by a model-free reinforcement learning algorithm, so that other elements except the state transition probability are only considered.

Wherein, the state space S of the unmanned aerial vehicle prevention and control command decision model is as follows:

S＝[d_t，v_t，θ_t，ψ_t，，t_l，t_j]，

wherein d is_tThe expression of (a) is:

wherein (x)_a，y_a，z_a) Representing radar coordinates, (x)_b，y_b，z_b) Representing the coordinates of the small unmanned aerial vehicle; superscript t and t-dt respectively represent the directions of the unmanned aerial vehicle at the t moment and the previous moment; dt represents a simulated step time interval; d_tRepresenting the distance of the drone from the radar; v. of_tRepresenting the flight rate of the drone; t is t_lRepresenting the light emitting time of the laser interception equipment; t is t_jRepresenting an interference time of a radio interfering device; theta and psi are denoted as the pitch angle and yaw angle of the drone, respectively.

Wherein, action space A ═ D of unmanned aerial vehicle prevention and control command decision model_t，D_a1，D_a2]Device type D_tValue of 0 or 1, value of the specific action [ D_a1，D_a2]Including four combinations of 00, 01, 10 and 11.

The treatment equipment for preventing and controlling the small unmanned aerial vehicle comprises laser interception equipment and radio interference equipment, wherein actions of the laser equipment comprise four actions of opening the laser equipment, closing the laser equipment, keeping the equipment state and adjusting laser pointing direction, the radio interference equipment is basically the same, and the actions comprise four actions of opening interference, closing interference, keeping action and adjusting interference pointing direction.

And the actions are coded by adopting three-digit binary numbers, wherein the first digit represents the type of the equipment, and the last two digits represent the specific actions corresponding to the equipment, namely the action taken by the prevention and control system is represented by a triple.

The specific content of the reward function R of the unmanned aerial vehicle prevention and control command decision model is as follows:

when the intention of the defense and control system is to defend the medium-long distance target, the defense success condition is

Wherein R is_a、R_dAnd R_vRespectively representing an angle reward function, a distance reward function and a speed reward function; q represents an included angle between the velocity vector and a connecting line of the unmanned aerial vehicle and the radar; q. q.s_mRepresenting a critical point angle; when the relative angle q is 0 degrees, the punishment is maximum; when q is 180 °, the penalty is minimal. The distance reward is expressed by a linear function related to the distance, k is a smoothing coefficient of the retention function at a critical point, d_fAnd d_lRespectively representing the maximum radius of the protective area and the radius of the core area; v. of_min，v_max，v_xhRespectively representing the minimum speed, the maximum speed and the cruising speed of the drone targets.

R is to be_a，R_dAnd R_vAnd weighting to obtain a comprehensive single-step reward R:

R＝a₁·R_a+a₂·R_d+a₃·R_v

wherein, a₁，a₂，a₃The weight corresponding to each reward function can be obtained according to the empirical value and satisfies the following constraint a₁+a₂+a₃＝1(a₁，a₂，a₃≥0)

And 4, training and optimizing a prevention and control command decision model. And training the unmanned aerial vehicle prevention and control command decision model by using a Deep Q network algorithm (Deep Qnetwork) until the decision model can generate unmanned aerial vehicles with different prevention and control intents effectively, and obtaining a neural network corresponding to the model when the defense success rate of the strategy exceeds a certain threshold.

The DQN algorithm provides a technology of applying experience playback and a fixed target network, and is one of the more popular deep reinforcement learning algorithms; the schematic diagram is shown in fig. 2, a value evaluation network and a value target network are constructed in the diagram, the output of the value evaluation network can be represented as Q (s, a | θ), and the corresponding parameter is θ; the value target network output value is expressed as

Corresponding to a parameter theta^-(ii) a For the value evaluation network, the input is the action a taken at the last moment and the state s at the moment, and the output is Q (s, a); updating an optimized value evaluation network parameter theta in a mode of minimizing the difference between the evaluation network state action value and the target network state action value under the network, wherein the Q value corresponding to the evaluation network is directly output according to the network, and the Q value corresponding to the target network is directly output according to the network

The value is output from the target network and the reward r_jThe structure is specifically shown as the following formula:

wherein, the subscript j represents the index of the jth data in the experience pool adopted data; gamma is the reward discount coefficient; l (θ) represents the loss function of the training evaluation network.

For the evaluation network, the input is the current environment state s, the output is the action a, and the parameter θ of the network is updated toward the direction of increasing the output value of the evaluation network, as shown in the following formula:

updating the parameters of the target network by temporarily freezing the parameters of the target network every time a certain step length is reached, theta^-←θ。

Training a small unmanned aerial vehicle prevention and control command decision model by using a DQN algorithm, specifically programming by using python3.8, adopting a Pythrch deep learning framework, setting 3 full-connection layers in total by adopting a neural network architecture formed by full-connection layers for a target network and an evaluation network, and respectively selecting 200, 100 and 50 neurons; the upper limit of each training is set to 10000 rounds, and the step size of each round is set to 10⁵When the defense success rate of the strategy exceeds a certain threshold value, specifically, when the defense success rate reaches 270 or more rounds in each 300 training rounds, the training is stopped at the moment, and the neural network model parameters at the moment are stored.

And 5, verifying and evaluating the effect of the decision model. The method comprises the steps of loading a control command decision model obtained by training in a typical small unmanned aerial vehicle control battle scene, making a decision according to a state space s obtained in real time from the scene to obtain a real-time unmanned aerial vehicle control strategy, and using a disposal device operation a in the scene to change the environment state and obtain real-time reward feedback.

Fig. 3 is a composition diagram of the reinforcement learning-based small unmanned aerial vehicle prevention and control command decision system of the present invention, which includes: the system comprises a multi-source data fusion module, a situation analysis module, a prevention and control planning module and an effect evaluation module.

The data fusion module is used for fusing data acquired by detecting the prevention and control environment and the target by the multi-type detection means; aiming at different types of prevention and control targets, information extraction, management, compilation and the like are carried out on prevention and control environment elements, prevention and control elements and disposal elements;

the situation analysis module is used for carrying out attribute analysis and judgment and threat assessment on the multi-source target data; performing attribute analysis and judgment on multi-source target data in the whole process of prevention and control judgment, and constructing a threat level model for threat assessment;

the control planning module is used for providing automatic treatment decision support for the unmanned aerial vehicle control specific tasks and resource planning activities; by adopting the small unmanned aerial vehicle prevention and control decision method based on reinforcement learning, the composition of a small unmanned aerial vehicle prevention and control system is clarified, and an internal model of the small unmanned aerial vehicle prevention and control system is constructed so as to extract combat situation information; designing a state space, an action space and a reward function, and constructing a small unmanned aerial vehicle prevention and control command decision model; training and optimizing a prevention and control command decision model to obtain a prevention and control disposal strategy, and verifying and evaluating the effect of the decision model;

the effect evaluation module is used for evaluating relevant disposal strategies and effects of unmanned aerial vehicle prevention and control and providing real-time feedback for unmanned aerial vehicle prevention and control command decision actions; and analyzing and processing the real-time prevention and control environment situation, the prevention and control target damage degree and the specific attack condition of the prevention and control treatment equipment.

An application method of a small unmanned aerial vehicle prevention and control command decision system based on reinforcement learning comprises the following steps:

s1: the data fusion module is used for fusing data acquired by detection of the prevention and control environment and the targets by a multi-type detection means to the prevention and control targets of different types based on information extraction, management, compilation and the like of the prevention and control environment elements, the prevention and control elements and the disposal elements;

s2: the situation analysis module is oriented to the whole process of prevention and control judgment, performs attribute analysis and judgment on multi-source target data, constructs a threat level model for threat assessment, is used for mastering the threat degree of a related target, and uploads threat judgment information to the prevention and control planning module;

s3: the control planning module adopts the small unmanned aerial vehicle control decision method based on reinforcement learning to make clear the composition of the small unmanned aerial vehicle control system and construct an internal model of the small unmanned aerial vehicle control system so as to extract the combat situation information; designing a state space, an action space and a reward function, and constructing a small unmanned aerial vehicle prevention and control command decision model; training and optimizing a prevention and control command decision model to obtain a prevention and control disposal strategy, and verifying and evaluating the effect of the decision model; the finally obtained small unmanned aerial vehicle prevention and control command decision model can be used for providing automatic disposal decision support for unmanned aerial vehicle prevention and control specific tasks and resource planning activities;

s4: the effect evaluation module analyzes and processes the real-time prevention and control environment situation, the damage degree of the prevention and control target and the specific striking situation of the prevention and control disposal equipment, is used for evaluating the relevant disposal strategies and effects of the prevention and control of the unmanned aerial vehicle, and provides real-time feedback for the decision-making action of the prevention and control command of the unmanned aerial vehicle.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A small unmanned aerial vehicle prevention and control command decision method based on reinforcement learning is characterized by comprising the following steps:

s1, determining the composition of a small unmanned aerial vehicle prevention and control system; determining the composition of a small unmanned aerial vehicle prevention and control system, wherein the small unmanned aerial vehicle prevention and control system comprises a detection subsystem, a disposal subsystem and a command control system; the system comprises a detection subsystem, a disposal subsystem, a command control system and a command control system, wherein the detection subsystem is used for providing combat situation information, the disposal subsystem is responsible for implementing prevention and control disposal, and the command control system is used for receiving the combat situation information from the detection subsystem and scheduling a plurality of disposal means to generate a disposal strategy; the detection subsystem comprises single-type or multi-type detection equipment, and the disposal subsystem comprises multi-type soft killing disposal equipment and hard interception disposal equipment; the command control system comprises a multi-source data fusion module, a situation analysis module, a prevention and control planning module and an effect evaluation module;

s5, verifying and evaluating the prevention and control effect of the small unmanned aerial vehicle prevention and control command decision model;

the step S3 specifically includes: the treatment equipment of the unmanned aerial vehicle prevention and control system comprises laser interception equipment and radio interference equipment, wherein actions of the laser equipment comprise four actions of turning on the laser equipment, turning off the laser equipment, keeping the equipment state and adjusting laser pointing direction, and actions of the radio interference equipment comprise four actions of turning on interference, turning off interference, keeping action and adjusting interference pointing direction; the method comprises the steps that various actions of disposal equipment are coded by adopting three-bit binary numbers, the first bit of the three-bit binary numbers represents the type of the equipment, the last two bits of the three-bit binary numbers are used for representing the corresponding specific actions of the equipment, and the action taken by the disposal equipment of the prevention and control system is represented by a triple group formed by the three-bit binary numbers;

S＝[d_t，v_t，θ_t，ψ_t，t_l，t_j]，

wherein d is_tThe expression of (a) is:

wherein the content of the first and second substances,

and

respectively representing the position coordinates of the small unmanned aerial vehicle at the time t and the time t-delta t, (x)_a，y_a，z_a) Representing the position coordinates of the detection device, at representing the stepping time interval of the Markov decision process; d_tThe distance between the small unmanned aerial vehicle and the detection equipment at the moment t is represented; v. of_tRepresenting the flight speed of the small unmanned aerial vehicle at the moment t; t is t_lRepresenting the light emitting time of the laser interception equipment; t is t_jRepresenting the time when the radio interference device is on; theta and psi are respectively expressed as the pitch angle and yaw angle of the unmanned aerial vehicle;

the expression of the action space A of the unmanned aerial vehicle control command decision model is A ═ D_t，D_a1，D_a2]Wherein the device type D_tThe value is 0 or 1, and the action type of the equipment is determined by an action variable D_a1And D_a2Is a combination of (1) represents an action variable [ D ]_a1，D_a2]The specific values of (a) include four combinations of 00, 01, 10 and 11;

respectively indicating the visual angle of the detecting equipment at the unmanned aerial vehicleReward values within the range and out of the range of the line-of-sight angle of the unmanned aerial vehicle, and when the angle q is 0, the angle reward value is minimum; when the angle q is pi, the angle reward value is maximum; the distance reward function is expressed by a linear function related to the distance, k is a smooth coefficient keeping the distance reward function at the minimum reward positive value, d_fAnd d_cRespectively representing the maximum radius of a prevention and control area of the small unmanned aerial vehicle and the minimum detection distance of detection equipment;

respectively representing reward coefficients corresponding to the fact that the flying speed of the small unmanned aerial vehicle is lower than a certain flying speed threshold value and higher than a maximum flying speed threshold value; v. of_min，v_max，v_xhRespectively representing the minimum flying speed, the maximum flying speed and the cruising flying speed of the small unmanned aerial vehicle;

R＝a₁·R_a+a₂·R_d+a₃·R_v，

2. The reinforcement learning-based unmanned aerial vehicle control and command decision method according to claim 1,

the detection subsystem comprises radar detection equipment, photoelectric detection equipment and radio detection equipment, and the treatment subsystem comprises radio interference equipment and laser interception equipment.

3. The reinforcement learning-based unmanned aerial vehicle control and command decision method according to claim 1,

the step S2 specifically includes: taking the small unmanned aerial vehicle as particles, and establishing a three-degree-of-freedom particle motion model:

wherein (x, y, z) represents the coordinates of the small unmanned aerial vehicle in a three-dimensional space coordinate system of the ground, v, theta and psi respectively represent the flight speed, the pitch angle and the yaw angle of the small unmanned aerial vehicle, and t represents time.

4. The reinforcement learning-based unmanned aerial vehicle control and command decision method according to claim 1,

the step 4 specifically includes: training a small unmanned aerial vehicle prevention and control command decision model by using a depth Q network algorithm until the small unmanned aerial vehicle prevention and control command decision model can generate prevention and control disposal strategies for driving away and damaging and striking of small unmanned aerial vehicles executing different tasks, and stopping training and storing neural network model parameters at the moment when the defense success rate of the strategies exceeds a certain threshold value, thereby completing the training and optimization of the small unmanned aerial vehicle prevention and control command decision model;

representing the value target network output corresponding to the jth data

represents a state variable s_j+1Take action a_j+1After, value orderMaximum of target network output

The value of the one or more of the one,

represents a state variable s_j+1Take action a_j+1Then, the least square error between the predicted value of the value target network and the real value of the target;

wherein the content of the first and second substances,

represents the gradient of the loss function L (theta) to the parameter theta; by adopting a method of temporarily freezing the value target network parameters, after a certain training period of the value evaluation network is reached, the parameters of the value target network are updated, and the value evaluation network parameters theta are transmitted to the value target network parameters theta^-Thereby maintaining stage stationarity of the value target network;

5. The reinforcement learning-based unmanned aerial vehicle control and command decision method according to claim 1, wherein the step S5 specifically comprises: and loading the small unmanned aerial vehicle control command decision model obtained in the training of the step S4 in a small unmanned aerial vehicle control actual scene, making a decision according to a state space obtained in real time from the small unmanned aerial vehicle control actual scene to obtain a handling action variable a, applying the handling action variable a to the actual scene, immediately obtaining a small unmanned aerial vehicle control strategy, changing an environmental state and obtaining real-time reward feedback.