CN114679729A

CN114679729A - Radar communication integrated unmanned aerial vehicle cooperative multi-target detection method

Info

Publication number: CN114679729A
Application number: CN202210336444.7A
Authority: CN
Inventors: 郑少秋; 张涛; 赵朔; 冯建航; 孔俊俊; 张政伟; 施生生; 蒋飞; 朱琨
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-06-28
Anticipated expiration: 2042-03-31

Abstract

The invention provides a radar communication integrated unmanned aerial vehicle cooperative multi-target detection method, which is characterized in that a plurality of unmanned aerial vehicles carry radar communication integrated equipment for cooperative detection, each unmanned aerial vehicle is set as an intelligent body, a stable detection strategy is trained, the trained strategy is used for controlling flight tracks of the unmanned aerial vehicles and resource allocation between radar and communication, and finally a given detection task is quickly completed. The method takes the radar, communication and unmanned aerial vehicle flight states observed by each intelligent agent as the input of a strategy generation module, uses a deep neural network to map the states and actions observed by each intelligent agent into a random strategy, uses a strategy evaluation module to evaluate the strategy of each intelligent agent, and obtains a better cooperative strategy through module training. According to the invention, the search of multiple targets in the designated area is realized by efficiently planning resources such as radar, communication and the like on multiple unmanned aerial vehicles, and the search and discovery efficiency of multiple targets is greatly improved.

Description

Radar communication integrated unmanned aerial vehicle cooperative multi-target detection method

Technical Field

The invention belongs to the field of radar communication integration and cluster cooperative detection, and particularly relates to a radar communication integration unmanned aerial vehicle cooperative multi-target detection method.

Background

The simultaneous detection only considers resource allocation in a static environment, the track design of the unmanned aerial vehicle is not considered in work, and the track design of the unmanned aerial vehicle is important to exert maneuverability and flexibility. For example, a static radar communication integrated UAV network utility optimization method based on power control is designed for Ficus, Liupeng and Mianyi; the invention provides an unmanned aerial vehicle cluster static radar communication integrated resource allocation method under reinforcement learning. 3) The unmanned aerial vehicle often needs time-varying channels and limited observation information when allocating radar communication resources in a dynamic environment, and the traditional optimization method is difficult to solve the problems. Such as feijoa, liupeng, and new gambling, use game theory to distribute the power of radar-communication integrated UAVs.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem of the prior art, and provides a radar communication integrated unmanned aerial vehicle cooperative multi-target detection method, which comprises the following steps:

step 1, modeling an unmanned aerial vehicle cooperative multi-target detection problem;

and 2, designing a multi-agent cooperative detection scheme.

The step 1 comprises the following steps:

step 1-1, defining a problem;

step 1-2, designing flight path constraints of the unmanned aerial vehicle;

step 1-3, designing resource allocation under the integration of radar communication of the unmanned aerial vehicle;

step 1-4, measuring the performance of radar and communication of the unmanned aerial vehicle;

step 1-5, carrying out multi-unmanned aerial vehicle cooperative detection reinforcement learning modeling;

and 1-6, designing a strategy learning module and a strategy evaluation module.

The step 1-1 comprises the following steps: setting each unmanned aerial vehicle as an intelligent body, wherein all the intelligent bodies cooperate to complete the detection tasks of the areas, each unmanned aerial vehicle sends the information obtained by detection to the control center in real time through a communication link, the total detection time is T, and the data rate and the detection performance of the unmanned aerial vehicles and the control center are expected to be maximized by allocating radar and communication resources and the tracks of the unmanned aerial vehicles in the given areas within the detection time, wherein the detection performance is expressed by the detection fairness of all targets.

The step 1-2 comprises the following steps: dividing the whole detection time into S time slots, wherein the duration of each time slot is tau, each intelligent agent finishes detection and communication tasks in a small time period from the beginning of each time slot, and other time periods are used for flying; here, the time for communication and sounding is determined by the channel bandwidth allocated to both, and assuming that the channel bandwidth allocated to both is x hz, the execution time is 1/x. Typically this time is much less than τ.

At each flight interval, each drone can face θ_mFlight in the (t) direction of 0, 2 pi_m(t)∈[0，l^Max]Distance of l^MaxRepresents the maximum distance a drone can fly during time τ, this distance being determined by the model of the drone;for a coordinate of [ x ]_m(0)，y_m(0)]The departing agent, the movement within time t is represented as:

wherein l_m(t) represents the actual moving distance of the mth unmanned plane in the tth time slot; theta_m(t ') represents the flight direction of the mth drone during the t' th time slot;

set that the unmanned aerial vehicle can only be in [ X ]^Min，X^Max]×[Y^Min，Y^Max]Thus, there are:

X^Min≤x_m(t)≤X^Max

Y^Min≤y_m(t)≤Y^Max

wherein, X^Min，X^Max，Y^Min，Y^MaxRespectively representing the movement minimum value of the unmanned aerial vehicle movement coordinate on an x axis, the movement maximum value on the x axis, the movement minimum value on a y axis and the movement maximum value on the y axis; the three-dimensional rectangular coordinate system with the origin of 0 is used, the X-y axis represents the ground, and the minimum value and the maximum value of the unmanned aerial vehicle capable of flying in the X-axis direction are X^Min，X^MaxIn the Y-axis direction, the minimum value and the maximum value that each unmanned aerial vehicle can fly are Y^Min，Y^Max. The positive half axis of the z-axis represents the flight height of the drone.

Set for safe distance between the unmanned aerial vehicle, show as:

d_mm′(t)≥D^S

wherein, d_mm′(t) represents the distance of the mth drone to the mth' drone at the tth time slot; d^SRepresenting the safe distance between any two drones.

The steps 1-3 comprise: the resources allocated for each drone radar and communication process are transmit power and channel:

for a given total transmitted power P, a power division factor is usedThe sounding and communication functions allocate the respective power,

representing the communication power allocated to the mth drone at time t,

indicating the radar transmission power, beta, allocated to the mth drone at time t_m(t) represents the power allocation factor of the mth agent at time t;

for a total of K channels, ρ_mk(t) denotes the selection of the kth channel at time t, p_mkWhen (t) is 1, the mth agent selects the kth channel, ρ_mkWhen (t) is 0, the mth agent does not select the kth channel.

The steps 1 to 4 comprise:

according to the power distributed to the mth unmanned aerial vehicle at the moment t

The detection range of each agent is estimated using the following radar equation:

wherein B represents the drone communication channel bandwidth; phi is a_m(t) represents the farthest distance that the mth drone can probe in the tth time slot; g^TxAnd G^RxRespectively representing the gain of the transmission and the gain of the receiving antenna, λ representing the wavelength of the transmitted signal, σ representing the effective detection area, Γ representing the boltzmann constant, T₀Representing thermodynamic temperature, F and gamma representing radar noise and detection loss, respectively, phi^MinRepresents a minimum signal-to-noise ratio for drone detection;

the condition that the mth agent detects the nth agent is defined as follows: phi is a_m(t)≥d_mn(t) wherein d_mn(t) represents tThe distance between the mth agent and the nth target at the moment;

defining a probe score ε_n(t) is:

wherein, c_n(t) represents the number of times the nth object was detected by time t;

the fairness g (t) that defines the target being probed is:

wherein, N represents the total number of detected targets.

The steps 1 to 5 comprise: using a 5-tuple

To describe a decision process wherein

Refers to the viewing space of each agent,

refers to the joint state space of all agents,

refers to the action space of the intelligent agent,

refers to the reward function of the agent,

refers to the transition probability of each agent;

observation space

Is defined asCurrent time coordinate (x) of m agents_m(t)，y_m(t)), distance l moved at previous time_m(t-1), direction θ_m(t-1), the last time is the channel rho allocated to the communication function of the unmanned aerial vehicle_m(t-1), communication and radar power distribution factor beta of the last moment_m(t-1), communication data rate R obtained at the last time_m(t-1), indicated as a whole by

Movement space

The motion space is defined as the mth agent moving direction theta in the current moment_m(t) a distance l movable in this direction_m(t), communication channel allocation factor ρ_m(t) and a power distribution factor β_m(t), generally expressed as

Reward function

Defining the detection reward and punishment of error behavior of all agents, and expressing the punishment as

Wherein R is_m(t) represents the communication data rate measured by the mth agent at time t;

and

respectively representing punishment obtained when the mth unmanned aerial vehicle crosses the boundary, punishment obtained when the unmanned aerial vehicles collide with each other and punishment obtained when the radar cannot cover the ground；

State space

Containing observation information of all agents, denoted as

Transition probability

Is shown as

Wherein

Representing the joint action of all agents.

Steps 1-6 include: configuring a strategy learning module and a strategy evaluation module for each unmanned aerial vehicle, wherein the strategy learning module is used for generating strategies, and the strategy evaluation module is used for evaluating the generated strategies;

the strategy learning module comprises an online strategy network pi of the mth unmanned aerial vehicle_θm (o, a), historical policy network

An optimizer and a loss function; o and a represent the set of states and actions of the drone, respectively;

the online strategy network is used for generating a random strategy, mapping the collected state and corresponding action of each agent into strategy distribution through a neural network, and adopting a Gaussian model as the strategy distribution;

the historical strategy network is used for reusing the historical experience collected by each agent so as to enhance the sampling efficiency of each agent, and the loss function of each agent is set to be the expected return J (theta) of each agent^m) Is shown as

Wherein theta is^mA parameter representing the policy network in the mth agent,

the function of the expectation is represented by,

representing a probability ratio between a current policy and a historical policy; function f^CLFor mixing x (theta)^m) Restricted to [ 1-e, 1+ e]Is shown as

E represents a limiting parameter;

represents a merit function;

the policy evaluation module evaluates the policies obtained by each agent by generating a merit function, expressed as

Wherein

Representing an evaluation network value function in the m-th agent, wherein omega represents a parameter of a corresponding evaluation network, and gamma represents a discount factor;

representing the reward obtained by the mth drone at time t;

enhancing exploratory behavior of an agent in an environment by introducing a state entropy function, represented as

Wherein

Representing the entropy function of the online policy pi.

The step 2 comprises the following steps:

step 2-1, initializing model parameters: initializing parameters of different modules, including a parameter θ of an online policy network^mParameters of a history policy network

Evaluating a parameter omega of a network^mLearning rate beta of policy network^AEvaluation network learning rate beta^IAnd a discount factor γ;

step 2-2, collecting samples:

obtaining an observation vector after each unmanned aerial vehicle observes the environment

Including the coordinates of each drone at the current moment and the movement information of each drone at the previous moment, expressed as

Step 2-3, inputting the observation vector into a deep neural network to obtain online strategy distribution, and then sampling from the online strategy distribution to obtain a corresponding action vector:

the motion vector obtained by sampling is generally expressed as

Adopting a Gaussian model as strategy distribution, and for the mth unmanned aerial vehicle, the online strategy distribution pi of the mth unmanned aerial vehicle_θm (o, a) is represented by:

wherein o is^mAnd a^mRespectively represent the m-thThe state observed and actions performed by the individual agent; μ and σ represent mean and standard deviation functions, respectively;

step 2-4, sampling and executing actions:

allocating power of P beta (t) for communication process of each unmanned aerial vehicle, allocating (1-beta (t)) P radar transmission power for radar process, and selecting the second

A channel in which

Representing an upper rounding function;

controlling each unmanned aerial vehicle to be in theta_mFlight in the direction of (t) < i >_m(t) distance;

step 2-5, detecting punishment action:

defining three punishment behaviors for each unmanned aerial vehicle, wherein the punishment behaviors comprise boundary crossing, mutual collision and incapability of covering the ground;

respectively represent the penalty obtained by the mth unmanned plane crossing the boundary, and are represented as:

wherein xi₁Represents a penalty value;

the penalty obtained by mutual collision between the mth unmanned aerial vehicle and the mth' unmanned aerial vehicle is represented as:

wherein xi₂Represents a penalty value; d_mm′(t) represents a distance between the mth drone and the mth' drone,D^Sdefining a safe distance between any two drones;

the penalty that the mth unmanned plane cannot cover the ground is represented as:

wherein xi₃Represents a penalty value; h represents the farthest distance that can be detected;

calculating the final reward obtained by each unmanned aerial vehicle by counting the punishment obtained by each unmanned aerial vehicle

After the action of the current time slot is finished, each unmanned aerial vehicle observes and obtains the state when the next time slot starts

Checking whether the mth unmanned aerial vehicle has three punishment behaviors, if so, rolling back to the current state at the next moment

Step 2-6, generating the joint state information:

each unmanned aerial vehicle sends respective state information to the information fusion center, and the information fusion center integrates all observation information

Sending the state information of the current moment to each unmanned aerial vehicle;

representing a set of drones;

each unmanned aerial vehicle continuously repeats step 2-2 to step 2-6 until the jth batch is obtained, N in total^BAn observation information B_s，jStatus information

Action information

The jth lot of awards is expressed as

And 2-7, updating the network parameters.

The steps 2-7 comprise: use of

Updating a parameter θ of a policy generating network^mExpressed as:

wherein L is^A(θ^m)＝J(θ^m)+f^E(θ^m) A loss function representing the policy network,

represents a gradient;

copying parameters in online policy network directly to historical policy network

π_θRepresenting a policy obtained from an online network,

representing historical policies of the agent;

using B_s，j，B_r，jUpdating the parameter phi, using B_s，j，

Updating parameters of an evaluation network

β^IIndicates the learning rate of the evaluation network, A^I(ω^m) The function of the merit is expressed as,

represents the pair omega^mA gradient of (a);

and (3) repeating the steps 2-1 to 2-7, and if all the targets are detected or one training round is finished, performing a new round of training until all the unmanned aerial vehicles finish all rounds of training.

Aiming at the problems of the existing unmanned aerial vehicle cluster cooperative target detection method, the method provided by the invention has the advantages that firstly, the radar communication is integrated, the communication function and the detection function share the radar frequency spectrum, the problem of communication frequency spectrum resource shortage is solved, meanwhile, the load of the unmanned aerial vehicle is reduced, the hardware cost is saved, and the weight of the unmanned aerial vehicle is reduced; secondly, aiming at the problem of radar communication resource interference and the problem of resource planning, the same detection signal waveform is designed to complete the communication and radar functions, and the unified planning of the radar communication resources is carried out based on the reinforcement learning intelligence, so that the adaptability of the dynamic complex scene is improved; and thirdly, when planning radar communication resources, the speed and the direction of each unmanned aerial vehicle in the unmanned aerial vehicle cluster are controlled in real time, the flight track of the unmanned aerial vehicle is controlled by designing a multi-agent strategy facing to incomplete information search, collision and flying out of a detection area between the unmanned aerial vehicles are avoided, and the adaptability of searching unknown environments is ensured. And fourthly, aiming at the problem that only part of targets are detected and the unknown targets at the long-distance edge are difficult to detect when a plurality of targets wait to be detected in the given environment, providing a geographic fairness index to measure the fairness of the detected targets, and ensuring that all the targets can be detected by maximizing the index.

The invention is different from the prior detection method based on vision, and the invention uses radar to detect the target, thereby solving the problem that the common vision detection is sensitive to the environmental condition. Meanwhile, the radar and communication integrated technology is used for assisting the detection process, so that the unmanned aerial vehicle can finish radar detection and communication functions only by carrying one device, and the flight parameters of the unmanned aerial vehicle are adjusted through deep reinforcement learning of multiple intelligent agents, and different resources are allocated for the radar and the communication functions to carry out efficient target detection.

Compared with the prior art, the invention has the remarkable advantages that: (1) dynamic environment detection under the integrated assistance of radar communication is considered, and the maneuverability and flexibility of the unmanned aerial vehicle are fully exerted; (2) the detection strategy is learned by using a deep learning technology, so that the method can be applied to large-scale complex detection tasks; (3) and multi-agent reinforcement learning is designed to drive cooperative detection among the unmanned aerial vehicles, so that a plurality of unmanned aerial vehicles can efficiently complete detection tasks.

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

Fig. 1 is a radar communication integrated auxiliary unmanned aerial vehicle cooperative target detection flow chart.

Fig. 2 is a schematic diagram of a multi-unmanned-aerial-vehicle cooperative detection model with radar communication integrated assistance.

FIG. 3 is a conceptual diagram of the method of the present invention.

Detailed Description

As shown in fig. 1, 2 and 3, the invention provides a radar communication integrated unmanned aerial vehicle cooperative multi-target detection method, in the scheme, based on unmanned aerial vehicle trajectory control and resource control, reinforcement learning is assisted, a multi-unmanned aerial vehicle cooperative detection scene is shown in fig. 3, each unmanned aerial vehicle is provided with radar communication dual-function equipment to detect a target in a given area, and meanwhile, the unmanned aerial vehicle cooperative multi-target detection method keeps communication with an information fusion center. A multi-agent deep reinforcement learning algorithm is configured in a controller of each unmanned aerial vehicle, information observed by each agent in the environment is learned, corresponding actions are output at the same time, and the method structure is shown in fig. 2. The whole control system is shown as the attached figure 1 and comprises:

step 1: multi-agent collaborative process definition

The method firstly defines a multi-unmanned aerial vehicle cooperative detection process as a Markov decision process. The process uses a 5-tuple

Is described, wherein

Refers to the viewing space of each agent,

refers to the joint state space of all agents,

refers to the action space of the intelligent agent,

refers to the reward function of the agent,

refers to the transition probability of each agent.

(1) Observation space

Observation space

Contains 7 elements, which are respectively the current time coordinate (x) of the mth intelligent agent_m(t)，y_m(t)), distance l moved at previous time_m(t-1), direction θ_m(t-1), the last time is the channel rho allocated to the communication function of the unmanned aerial vehicle_m(t-1), communication and radar power distribution factor beta of the last moment_m(t-1), the communication data rate R obtained at the present time_m(t)。

That is to say that the first and second electrodes,the observation of the mth agent at time t may be represented as

(2) Space of action

Movement space

Defined as the mth agent moving direction theta in the current moment_m(t) a distance l movable in this direction_m(t), communication channel allocation factor ρ_m(t) and a power distribution factor β_m(t) of (d). That is, the action of the mth agent at the tth time is represented as:

(3) reward function

Reward function

Defining the detection reward and the punishment of the error behavior of all the agents, and the observation of the mth agent at the t moment is expressed as:

and

respectively representing punishment obtained when the mth unmanned aerial vehicle crosses the boundary, the unmanned aerial vehicles collide with each other and the radar cannot cover the ground; g (t) shows that the geographic fairness is obtained at the current moment, and the calculation method comprises the following steps:

wherein, N represents the total number of detected targets. c. C_n(t) represents the number of times the nth object was detected by time t.

(4) State space

State space

Contains the observed information for all agents, expressed as:

wherein, the first and the second end of the pipe are connected with each other,

representing a set of drones.

(5) Transition probability

Transition probability

Expressed as:

wherein the content of the first and second substances,

representing the joint action of all agents.

Step 2: initializing model parameters

Initializing parameters of different modules, including a parameter θ of an online policy network^mParameters of a history policy network

Parameter omega of evaluation network of distributed scheme^mLearning rate beta of policy network^AEvaluation network learning rate beta^IAnd a discount factor gamma. Here, the parameters used by both the policy network and the evaluation network are randomly initialized. The learning rates of the policy network and the learning network are important parameters affecting the learning effect, the convergence of the algorithm is easy to be very slow when an excessively small learning rate is set, and the convergence of the algorithm to a local optimal point is easy when an excessively large learning rate is set, so that the two parameters are debugged through multiple experiments. When the learning rate is adjusted, the discount factor can also be adjusted in a similar way, and a higher value, such as 0.99, is set, and is decreased by 0.01 or 0.02 each time until the algorithm converges to a larger total average reward.

After all the parameters are debugged, the online learning stage can be entered.

And 3, step 3: sample collection

First, each drone needs to collect sufficient samples for training of the policy network and the evaluation network.

Every m drones first need to determine the current position coordinates, i.e. x_m(t)，y_m(t), this position can be obtained by a GPS positioning device carried on the drone.

In addition, every m unmanned aerial vehicles need to extract the distance l moved at the last moment from the memory_m(t-1), topDirection of movement theta at a moment_m(t-1), communication channel ρ allocated at the last time_m(t-1) and the power distribution factor beta at the previous moment_m(t-1) and the data rate R at the previous moment_m(t-1). It is worth noting that when the sample is collected by the unmanned aerial vehicle at the time 0, the sample at the previous time is a random sample, and a value is generally taken from a random number generator of 0-1.

Therefore, in the sampling step, the observation information output by the mth drone is represented as:

and 4, step 4: an online policy distribution is generated. And inputting the observation vector into a deep neural network to obtain online strategy distribution, and then sampling from the strategy distribution to obtain a corresponding action vector.

The input to this step is the observation information collected in the previous step. Thus for the mth drone, the sequence of observations entered is

Then inputting the observation value sequence into a decision neural network to output corresponding strategy distribution, and adopting Gaussian distribution to fit the strategy distribution, wherein the strategy distribution is expressed as:

where μ and σ represent mean and standard deviation functions.

And 5: motion sampling and execution

First from the obtained strategic distribution π_θm (o, a), namely the distance l that the mth unmanned aerial vehicle needs to move at the current moment_m(t) direction of required deflection θ_m(t) channel rho distributed for communication of mth unmanned aerial vehicle and information fusion center at current moment_m(t) and power allocation factor are collectively expressed as:

the mth drone then performs the work obtained.

First of all for its communication process allocation

Power of, allocated to radar processes

The radar transmit power.

Select the first

A channel in which

Representing an upper rounding function. K denotes the total number of optional channels.

The mth drone uses the allocated channel and power resources to perform radar sounding and communication procedures.

For the radar detection process, the input information is the power of the current moment

The output is the detection fairness g (t) of the N targets, and the specific process is as follows:

first, the detection range of the mth drone is estimated, expressed as:

wherein phi is_m(t) represents the maximum detection range of the mth drone in the tth time slot. B denotes the communication channel bandwidth of the drone. G^TxAnd G^RxRespectively representing the gain of the transmission and the gain of the receiving antenna, λ representing the wavelength of the transmitted signal, σ representing the effective detection area, Γ representing the boltzmann constant, T₀Representing thermodynamic temperature, F and gamma representing radar noise and detection loss, respectively, phi^MinRepresenting a minimum signal-to-noise ratio of the unmanned aerial vehicle measurement; among these parameters, G^Tx、G^Rx、Γ、T₀For fixed values, other parameters may be measured by radar signal processing equipment.

Only targets in the radar detection range can be detected by the drone, so the condition for detecting the nth agent for the mth agent is: phi is a_m(t)≥d_mn(t) wherein d_mn(t) represents the distance between the mth agent and the nth target at time t;

then, the mth drone uses the allocated communication power

And channel

Performing a communication with the information fusion center, sending the radar probe channel to the information fusion center, and measuring the data rate R during the communication_m(t)。

The information fusion center calculates the number of times each target is detected according to the detection information collected by all the unmanned aerial vehicles, and then calculates the detection scores epsilon of all the unmanned aerial vehicles at the current moment_n(t)：

then, the detection fairness g (t) is calculated:

wherein, N represents the total number of detected targets.

And then, sending the calculated detection fairness value to each unmanned aerial vehicle.

Finally, each drone is according to the assigned direction θ_mFlight in the direction of (t) < i >_m(t) distance.

Step 6: penalty behavior detection

And setting penalty values for violation strategies according to the action obtained in the step 5, wherein the penalty values comprise boundary crossing, mutual collision and radar coverage loss. The significance of this step is that a negative reward is set for the non-compliance policy generated by each drone, so in order to maximize its own reward, the drone must learn the compliance policy step by step until the optimal policy is found.

First, if the mth drone crosses a given boundary, a boundary crossing penalty is set, denoted as:

wherein xi₁Represents a penalty value; x^Min，X^Max，Y^Min，Y^MaxThe range of motion of the drone is limited.

Then, if the mth drone and the mth' drone collide with each other, a collision penalty is set, denoted as:

wherein xi₂Represents a penalty value; d_mm′(t) represents the distance between the mth drone and the mth' drone. D^SA safe distance between any two drones is defined.

Then, if the mth drone can not cover the ground, the penalty is obtained as:

wherein xi₃Represents a penalty value; h represents the flying height of the drone.

Xi therein₁、Ξ₂Xi and xi₃The reward is set according to the unmanned plane, too small to set, and may be set to 0.1 times of the total reward, for example, the total reward is 100, and the penalty value may be set to 10.

Calculating the final reward obtained by each drone, i.e. by counting the penalties obtained by each drone

After the action at the current moment is finished, each unmanned aerial vehicle observes and obtains the state when the next time slot starts

Checking whether the mth unmanned aerial vehicle has three punishment behaviors of crossing the boundary, colliding or losing the radar load, and if the three punishment behaviors occur, rolling back the current state of the sword at the next moment

And 7: generating federated state information

The input to this step is the observation information for each drone

Action information

And obtaining a reward

The output is data of one batch.

Each unmanned aerial vehicle sends respective state information to the information fusion center, and the information fusion center integrates all observation messagesInformation processing device

And sending the state information of the current moment to each unmanned aerial vehicle.

Continuously repeating the steps 2 to 7 for each unmanned aerial vehicle until the jth batch is obtained, and N is total^BThe observation information, state information, and action information are represented by B_s，j，

The jth batch of awards is represented as

N^BThe larger the size of (c) is, the better the convergence effect is, because the larger the batch size means that more data is used for training, but not more than one (epamode) total training times, and can be adjusted by setting a larger value step by step initially.

And 8: network parameter update

This step is used to update the parameters of the policy network and the evaluation case, i.e. θ^mAnd ω^m. The input is the batch data obtained in the step 7, and the output is the trained network parameters.

The parameter updating of the policy network is divided into updating of an online policy network and updating of a historical policy network.

The parameters of the historical policy network are first updated. The network is mainly used for storing parameters in the existing online network and does not participate in the training process, so that the parameters of the existing online network are directly copied to the historical strategy network, and the method is represented as follows:

representing historical strategies, primarily for reusing historical experiences collected by each agent to enhance each intelligenceSampling efficiency of the energy body.

Then use

Updating parameter θ of policy generating network^mExpressed as:

representing a gradient.

J(θ^m) The penalty function representing the mth agent is set to the expected reward for each agent, expressed as

Wherein theta is^mA parameter representing an online policy network in the mth agent,

E represents a limiting parameter, and is generally 0.2;

represents a merit function for evaluating the strategy obtained by each agent, expressed as

Wherein

Representing the evaluation network value function in the mth agent.

f^E(θ^m) Representing a state entropy function for enhancing the exploratory behavior of an agent in an environment, represented as

Here, the

Representing the entropy function of the online policy pi.

Finally using B_s，j，

Updating a parameter omega of a policy evaluation network^mIs provided with

And (5) repeating the steps 1 to 8, and if all the targets are detected or one training round is finished, performing a new round of training until all the unmanned aerial vehicles finish all rounds of training.

Examples

The unmanned aerial vehicle detection method comprises the steps of firstly defining a detection range for unmanned aerial vehicle detection, enabling each unmanned aerial vehicle to obtain a current coordinate in real time through a GPS positioning device assembled for each unmanned aerial vehicle, and adjusting the learning behavior of the unmanned aerial vehicle through an algorithm when the coordinate exceeds the detection range at a certain moment, so that the unmanned aerial vehicle is prevented from crossing the boundary.

A collaborative process between the multiple drones is then defined using the markov model. The detectable range of the unmanned aerial vehicles is set to 2000M × 2000M, the number M of the unmanned aerial vehicles is set to 10, the number of objects to be detected is 100, the maximum number of time steps T taken to end detection from the start of detection is set to 200, and the duration of each step is 5 minutes. In addition, the farthest distance and the maximum angle of flight within one time step are set for each drone, where the farthest distance l is set to 20m meters, and the maximum angle θ is set to 360 degrees. Then, each drone first obtains environment information including coordinate information of the current time, a moving distance in the last time step, a moving direction in the last time step, a power allocation factor in the last time step, and a data rate in the last time step. Note that, in the 1 st time step, it is necessary to randomly take a value according to the approximate value range of each value, for example, the maximum flight distance is 20m, and here, the first flight distance may be 5 m. And then inputting the information into multi-agent reinforcement learning to learn the action of each unmanned aerial vehicle in the current time step, wherein the action comprises the distance that the unmanned aerial vehicle needs to fly in the current time step, the angle that the unmanned aerial vehicle needs to fly in the current time step, the distributed channels in the current time step and the power distribution factor.

Each drone then executes the learned action and updates the learning network. And when each unmanned aerial vehicle obtains the flight distance l, the flight angle theta, the channel allocation and the power allocation factor in the current time step through a learning algorithm. Firstly, each unmanned aerial vehicle detects whether a target exists around through radar communication integrated equipment, wherein the detection range is determined by power distributed for radar functions, then each unmanned aerial vehicle sends obtained radar detection information to a control center through a distributed channel, and the control center sends all information to each unmanned aerial vehicle after summarizing the information of all unmanned aerial vehicles. Each drone then uses this information to calculate the return obtained for this learning action, this return including the measured communication data rate, the fairness of all target detections, whether the drone has a collision and crosses the boundary, the radar cannot cover the ground, note that the inability to cover here is caused by allocating too little power to the radar. And then each unmanned aerial vehicle updates the respective learning network according to the calculated return information, and finally, each unmanned aerial vehicle flies l m at the flying angle theta. Through the process, each unmanned aerial vehicle continuously learns in the environment, and finally a stable strategy can be learned, wherein the strategy is the learned unmanned aerial vehicle cooperative multi-target detection method.

The invention provides a radar communication integrated unmanned aerial vehicle cooperative multi-target detection method, and a plurality of methods and ways for specifically implementing the technical scheme are provided, the above description is only a preferred embodiment of the invention, it should be noted that, for a person skilled in the art, on the premise of not departing from the principle of the invention, a plurality of improvements and embellishments can be made, and these improvements and embellishments should also be regarded as the protection scope of the invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. The radar communication integrated unmanned aerial vehicle cooperative multi-target detection method is characterized by comprising the following steps of:

and 2, designing a multi-agent cooperative detection scheme.

2. The method of claim 1, wherein step 1 comprises:

step 1-1, defining a problem;

step 1-2, designing flight path constraints of the unmanned aerial vehicle;

and 1-6, designing a strategy learning module and a strategy evaluation module.

3. The method of claim 2, wherein step 1-1 comprises: setting each unmanned aerial vehicle as an intelligent body, wherein all the intelligent bodies cooperate to complete the detection tasks of the areas, each unmanned aerial vehicle sends the information obtained by detection to the control center in real time through a communication link, the total detection time is T, and the data rate and the detection performance of the unmanned aerial vehicles and the control center are expected to be maximized by allocating radar and communication resources and the tracks of the unmanned aerial vehicles in the given areas within the detection time, wherein the detection performance is expressed by the detection fairness of all targets.

4. The method of claim 3, wherein steps 1-2 comprise: dividing the whole detection time into S time slots, wherein the duration of each time slot is tau;

at each flight interval, each drone can face θ_mFlight in the (t) direction of 0, 2 pi_m(t)∈[0，l^Max]Distance of l wherein^MaxRepresents the maximum distance a drone can fly during time τ, this distance being determined by the model of the drone; for a coordinate of [ x ]_m(0)，y_m(0)]The departing agent, the movement within time t is represented as:

X^Min≤x_m(t)≤X^Max

Y^Min≤y_m(t)≤Y^Max

wherein X^Min，X^Max，Y^Min，Y^MaxRespectively representing the movement minimum value of the unmanned aerial vehicle movement coordinate on an x axis, the movement maximum value on the x axis, the movement minimum value on a y axis and the movement maximum value on the y axis;

set for safe distance between the unmanned aerial vehicle, show as:

d_mm′(t)≥D^S

wherein d is_mm′(t) denotes the mth drone to mth' drone in the tth slotDistance of the drone; d^SRepresenting a safe distance between any two drones.

5. The method of claim 4, wherein steps 1-3 comprise: the resources allocated for each drone radar and communication process are transmit power and channel:

for a given total transmit power, P, a power allocation factor is used to allocate the corresponding power for the radar detection and communication functions,

representing the communication power allocated to the mth drone at time t,

for a total of K channels, p_mk(t) denotes the selection of the kth channel at time t, p_mkWhen (t) is 1, the mth agent selects the kth channel, ρ_mkWhen (t) is 0, the mth agent does not select the kth channel.

6. The method of claim 5, wherein steps 1-4 comprise:

wherein the content of the first and second substances,b represents the drone communication channel bandwidth; phi is a_m(t) represents the farthest distance that the mth drone can probe in the tth time slot; g^TxAnd G^RxRespectively representing the gain of the transmission and the gain of the receiving antenna, λ representing the wavelength of the transmitted signal, σ representing the effective detection area, Γ representing the boltzmann constant, T₀Representing thermodynamic temperature, F and y representing radar noise and detection loss, respectively, phi^MinRepresents a minimum signal-to-noise ratio for drone detection;

the condition that the mth agent detects the nth agent is defined as: phi is a_m(t)≥d_mn(t) in which d_mn(t) represents the distance between the mth agent and the nth target at time t;

defining a probing score ε_n(t) is:

the fairness g (t) that defines the target being probed is:

wherein, N represents the total number of detected targets.

7. The method of claim 6, wherein steps 1-5 comprise: using a 5-tuple

To describe a decision process wherein

Refers to the viewing space of each agent,

refers to the joint state space of all agents,

refers to the action space of the intelligent agent,

refers to the reward function of the agent,

refers to the transition probability of each agent;

observation space

Defined as the current time coordinate (x) of the mth agent_m(t)，y_m(t)), distance l moved at previous time_m(t-1), direction θ_m(t-1), the last time is the channel rho allocated to the communication function of the unmanned aerial vehicle_m(t-1), communication and radar power distribution factor beta of the last moment_m(t-1), communication data rate R obtained at the last time_m(t-1), generally indicated as

Movement space

Reward function

and

respectively representing punishment obtained when the mth unmanned aerial vehicle crosses the boundary, punishment obtained when the unmanned aerial vehicles collide with each other and punishment obtained when the radar cannot cover the ground;

state space

Containing observation information of all agents, denoted as

Transition probability

Is shown as

Wherein

Representing the joint action of all agents.