CN112256056B

CN112256056B - Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning

Info

Publication number: CN112256056B
Application number: CN202011118496.4A
Authority: CN
Inventors: 陈武辉; 杨志华; 郑子彬
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2022-03-01
Anticipated expiration: 2040-10-19
Also published as: CN112256056A

Abstract

The invention provides an unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning, wherein the method comprises the following steps: establishing an information acquisition task model according to parameters of the unmanned aerial vehicle group information acquisition system; the information acquisition task is divided into an acquisition subtask and a calculation subtask; constructing a deep neural network model according to the task model, and training the deep neural network model by utilizing a multi-agent deep reinforcement learning algorithm combined with an attention mechanism; and controlling the unmanned aerial vehicle group in the actual environment to complete the information acquisition task by using the trained deep neural network model. In the invention, each unmanned aerial vehicle is used as an intelligent agent, and the performance of the actor network is evaluated by using the critic network with an attention unit, so that the training speed of the actor network can be accelerated by using a more accurate evaluation value; when the information acquisition task is executed, each unmanned aerial vehicle does not need to communicate with other intelligent bodies, so that the communication time delay is reduced.

Description

Unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning

Technical Field

The invention relates to the technical field of wireless communication, in particular to an unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning.

Background

An Unmanned Aerial Vehicle (UAV) is an unmanned aircraft that is remotely controlled by an operator via a radio remote control device or automatically controlled by a computer program. The majority of applications of the unmanned aerial vehicle are information acquisition tasks, and in the prior art, the control instruction for the data acquisition tasks of the multi-unmanned aerial vehicle system is solved mainly by two methods, namely a heuristic method and a method based on machine learning.

The heuristic algorithm needs to obtain the most information acquisition and calculation migration scheme through multiple rounds of calculation after receiving the tasks, so that a large time delay is generated, and some urgent tasks are not facilitated; the deep reinforcement learning algorithm of the single intelligent agent needs to acquire the states of all unmanned aerial vehicles in a communication mode after receiving a task, a certain time delay is generated, meanwhile, as the number of the unmanned aerial vehicles increases, the training times required for a single deep neural network to achieve convergence are also greatly increased, and the obtained strategy is difficult to realize better energy consumption and time consumption.

Therefore, it is difficult for the drone system to make appropriate strategies within a short time delay when confronted with various complex tasks and environments.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning, and aims to solve the technical problem that an unmanned aerial vehicle system is difficult to make a proper strategy within a short time delay when facing various complex tasks and environments.

The purpose of the invention can be realized by the following technical scheme:

an unmanned aerial vehicle control method based on multi-agent deep reinforcement learning comprises the following steps:

establishing an information acquisition task model according to parameters of the unmanned aerial vehicle group information acquisition system; the information acquisition task is divided into an acquisition subtask and a calculation subtask;

constructing a deep neural network model according to the task model, and training the deep neural network model by utilizing a multi-agent deep reinforcement learning algorithm combined with an attention mechanism; wherein the agent is an unmanned aerial vehicle;

and controlling the unmanned aerial vehicle group in the actual environment to complete an information acquisition task by using the trained deep neural network model.

Optionally, before constructing the deep neural network model according to the task model, the method further includes:

the parameters of the unmanned aerial vehicle group information acquisition system are converted into a state space of the system and an action space of the intelligent agent, and an instant reward function is set.

Optionally, the deep neural network model specifically includes: the deep neural network model comprises an operator network and a critic network, wherein the operator network comprises an estimation operator network and a target operator network, the critic network comprises an estimation critic network and a target critic network, and attention units are embedded in the critic network on three full connection layers.

Optionally, the method further comprises: when an actor network is trained, a critic network with an attention unit is used for evaluating the performance of the actor network, and the specific process is as follows:

firstly, the number of unmanned planes in the unmanned plane cluster is N, and the observed value o of an unmanned plane i (i is more than or equal to 1 and less than or equal to N) is used_iAnd an action value a_iInputting the single-layer full-connection layer to obtain the state action characteristic value g (o) of each unmanned aerial vehicle_i,a_i) Inputting the state action characteristic values of all unmanned aerial vehicles into an attention unit;

the attention unit calculates attention weight alpha of the unmanned aerial vehicle j according to the characteristic value of the unmanned aerial vehicle i and the characteristic values of the other unmanned aerial vehicles j (j ≠ i)_j：

Wherein the content of the first and second substances,

and W_qIs a learnable attention parameter matrix;

calculating the influence value e of the rest unmanned aerial vehicles on the unmanned aerial vehicle i in a weighted sum mode according to the attention weight and the feature values of the rest unmanned aerial vehicles_i

The state action characteristic value g (o) of the unmanned aerial vehicle i_i,a_i) And the influence value e_iThe value Q of the action state of the unmanned aerial vehicle is obtained by inputting the value Q into a double-layer full-connection layer network_i。

Optionally, the training of the deep neural network model by using a multi-agent deep reinforcement learning algorithm with an attention mechanism specifically includes:

s201: randomly initializing a system state and a neural network parameter;

s202: acquiring an observation value X of the current time slot of each unmanned aerial vehicle as [ o ] according to the system state and the observation range of the unmanned aerial vehicle₁,o₂,…,o_M](ii) a Wherein M is the number of unmanned aerial vehicles in the unmanned aerial vehicle groupAn amount;

s203: the observed value o of each unmanned aerial vehicle_iInputting the data into a corresponding actor network to obtain an action value a corresponding to each unmanned aerial vehicle_i(ii) a Wherein i is more than or equal to 1 and less than or equal to M;

s204: according to the system state and the action value A of all the unmanned planes in the current time slot ═ a₀,a₁,…,a_M]Get the reward R ═ R for all drones₀,r₁,…,r_M]The system next slot state S' and the observed value X ═ o₁′,o₂′,…,o′_M]Storing the experience samples (X, A, R, X') in an experience pool of the intelligent agent;

s205: and repeating S202-S204 until the number of the experience pool samples reaches a set threshold, and extracting a certain number of experience samples from the experience pool to update the neural network parameters until the strategy function of the operator network converges.

Optionally, the step of controlling the unmanned aerial vehicle group in the actual environment to complete the information acquisition task by using the trained deep neural network model specifically includes:

parameterizing the state of the unmanned aerial vehicle system and the observed value of each unmanned aerial vehicle in the actual environment;

inputting the observation value parameterized by the unmanned aerial vehicle into a trained operator network to obtain an action value of the unmanned aerial vehicle;

and converting the action value into an acquisition instruction and a calculation instruction, and acquiring information and calculating and transferring by the unmanned aerial vehicle according to the acquisition instruction and the calculation instruction.

The invention also provides an unmanned aerial vehicle control system based on multi-agent deep reinforcement learning, which comprises the following components:

the task model establishing module is used for establishing an information acquisition task model according to parameters of the unmanned aerial vehicle group information acquisition system; the information acquisition task is divided into an acquisition subtask and a calculation subtask;

the deep neural network building and training model is used for building a deep neural network model according to the task model and training the deep neural network model by utilizing a multi-agent deep reinforcement learning algorithm combined with an attention mechanism; wherein the agent is an unmanned aerial vehicle;

and the information acquisition task execution module is used for controlling the unmanned aerial vehicle group in the actual environment to complete the information acquisition task by utilizing the trained deep neural network model.

Optionally, the method further comprises:

and the system parameter conversion module is used for converting the parameters of the unmanned aerial vehicle group information acquisition system into a state space of the system and an action space of the intelligent agent and setting an instant reward function.

The present invention also provides a computer-readable storage medium for storing a computer program; wherein the computer program, when executed by a processor, implements the multi-agent deep reinforcement learning-based drone control method.

The present invention also provides an electronic device, comprising:

a memory for storing a computer program;

and the processor is used for executing the computer program to realize the unmanned aerial vehicle control method based on multi-agent deep reinforcement learning.

The invention provides an unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning, wherein the method comprises the following steps: establishing an information acquisition task model according to parameters of the unmanned aerial vehicle group information acquisition system; the information acquisition task is divided into an acquisition subtask and a calculation subtask; constructing a deep neural network model according to the task model, and training the deep neural network model by utilizing a multi-agent deep reinforcement learning algorithm combined with an attention mechanism; wherein the agent is an unmanned aerial vehicle; and controlling the unmanned aerial vehicle group in the actual environment to complete an information acquisition task by using the trained deep neural network model.

In the invention, each unmanned aerial vehicle is regarded as an intelligent agent, and the multi-intelligent-agent deep reinforcement learning only needs to acquire the reward value through interaction between each intelligent agent and the environment, so that the strategy of the intelligent agents is continuously learned and improved, and the state information of the whole system is not required to be acquired through communication when the intelligent agents make decisions, thereby avoiding the time delay of communication. When the actor network is trained, the critic network with the attention unit is used for evaluating the performance of the actor network, so that the influence of other agents with higher similarity on the actor network can be better noticed during evaluation, more accurate evaluation values are obtained to guide the training of the actor network, and the training speed of the actor network is accelerated. When the information acquisition task is executed, each unmanned aerial vehicle can obtain the control instruction of the task period only by directly inputting the observation value of the unmanned aerial vehicle into the trained actor network, so that the condition that the single intelligent agent depth reinforcement learning algorithm needs to collect the states and the observation values of all unmanned aerial vehicles through communication before the control instruction is formulated is avoided, and the reaction time delay is reduced.

Drawings

FIG. 1 is a schematic diagram of a neural network training framework of the multi-agent deep reinforcement learning-based unmanned aerial vehicle control method and system of the present invention;

FIG. 2 is a method flow diagram of the multi-agent deep reinforcement learning-based unmanned aerial vehicle control method and system of the present invention;

fig. 3 is a flowchart of a method of an embodiment of the multi-agent deep reinforcement learning-based drone control method and system of the present invention.

Detailed Description

Interpretation of terms:

compute offload (computing offload) is the transfer of resource-intensive computing tasks onto a separate processor (e.g., hardware accelerator) or an external platform (e.g., cloud server, edge server). Offloading to a coprocessor may be used to accelerate applications, including image rendering and mathematical computations. Offloading the computing to an external platform over a network may provide computing power and overcome hardware limitations of the device, such as limited computing power, storage, and energy.

Multi-agent deep reinforcement learning (Multi-agent deep reinforcement learning): in a multi-agent system, each agent learns to improve its own policy by interacting with the environment to obtain reward values (rewarded), thus obtaining a process of optimal policy in the environment.

Attention mechanism (Attention mechanism): the attention mechanism in deep learning is similar to the human selective mechanism in nature, and the core target is to select information which is more critical to the current task target from a plurality of information. Currently, attention mechanism has been widely used in various deep learning tasks such as natural language processing, image recognition and speech recognition, and is one of the most important core technologies in deep learning technology.

The embodiment of the invention provides an unmanned aerial vehicle control method and system based on multi-agent deep reinforcement learning, and aims to solve the technical problem that an unmanned aerial vehicle system is difficult to make a proper strategy within a short time delay when facing various complex tasks and environments.

To facilitate an understanding of the invention, the invention will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present invention are shown in the drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The unmanned aerial vehicle is mainly applied to the military field at the beginning of birth and is used for replacing a common manned aircraft to perform 'dull' or 'dangerous' tasks, such as intelligence reconnaissance, ammunition release and the like. With the improvement of the manufacturing technology of unmanned aerial vehicles and the emergence of unmanned aerial vehicles with various functions in recent years, the application range of unmanned aerial vehicles has been expanded to a plurality of civil fields, such as terrain exploration, traffic road condition monitoring, scenic spot aerial photography, natural disaster observation and the like. And as the complexity of the application is gradually increased, the unmanned aerial vehicle cluster is cooperated to gradually replace a single unmanned aerial vehicle so as to improve the efficiency of the system. For a single unmanned aerial vehicle, the most common control mode is manual remote control, but for a multi-unmanned aerial vehicle system, a large amount of manpower is consumed for configuring one controller for each unmanned aerial vehicle to control, and therefore the industry often uses a computer program to perform automatic control. For example, in the flight performance of the unmanned aerial vehicle cluster, each unmanned aerial vehicle is controlled by a preset program. However, in a complex and variable environment, a preset program cannot give a proper instruction to the unmanned aerial vehicle according to specific conditions. Therefore, a method for making different flight control commands according to different specific environmental conditions is needed.

The majority of the applications of the unmanned aerial vehicle can be regarded as information acquisition tasks, and the information data of the earth surface is acquired by utilizing devices such as a high-definition camera and an infrared sensor which are equipped by the unmanned aerial vehicle. Meanwhile, the data result required by the user is not only the original collected data such as a photo, but also a result obtained by calculating the collected original data to a certain degree. For example, for terrain exploration, the user-desired result is often a 3D terrain map rendered from the acquired data; for traffic road condition monitoring, the result required by the user is often road condition data such as traffic flow calculated according to the shot picture. Therefore, the information collection task of the unmanned aerial vehicle can be divided into an acquisition subtask and a calculation subtask. With the development of chip technology, the chip carried on the unmanned aerial vehicle can already complete a certain calculation task, but due to the limitations of electric quantity, time and the like, the unmanned aerial vehicle is difficult to independently complete all calculation tasks. In order to solve the problem, part of the computing tasks of the unmanned aerial vehicle can be calculated and migrated, namely, part of the computing tasks of the unmanned aerial vehicle are uploaded to the cloud server or the edge server, and the computing tasks are rapidly completed by the aid of the cloud server and the edge server which are higher in computing capacity. When calculation migration is carried out, the unmanned aerial vehicle system needs to pay for consumed server resources, so that the control program of the unmanned aerial vehicle information acquisition system needs to make a flight control instruction and also needs to make a calculation migration control instruction according to the balance of time and cost.

From the perspective of the drone system, it is aimed at minimizing the energy consumption of the system and the processing time of the tasks. Therefore, the unmanned aerial vehicle system needs to adjust its own control command according to the actual task state and the environmental state (e.g., the state of the server, etc.), so as to achieve the optimal energy consumption and task completion time, and such a problem can be regarded as a joint optimization problem. In the existing research, the control instruction for the data acquisition task of the multi-unmanned aerial vehicle system is solved by two methods, namely a heuristic method and a method based on machine learning. The heuristic algorithm is that a multi-unmanned aerial vehicle information acquisition task is modeled into a NP-hard combined optimization problem, and then the optimal information acquisition strategy is obtained after multiple rounds of calculation are carried out on multiple randomly generated combined solutions by utilizing algorithms such as a genetic algorithm, a particle swarm algorithm, simulated annealing and the like. The traditional heuristic algorithm needs to obtain unmanned aerial vehicle information acquisition and calculation migration control instructions through multiple rounds of calculation after receiving tasks, generates larger time delay and is not beneficial to the execution of some emergency tasks; deep reinforcement learning is used as one of machine learning methods, a deep neural network can be trained to serve as a strategy function, the system state of each time slot is input into the neural network, and specific actions of the unmanned aerial vehicle are output, so that the unmanned aerial vehicle system can be helped to make appropriate flight and calculation migration decisions. However, the current research adopts a deep reinforcement learning method based on a single intelligent agent, the whole system is regarded as an intelligent agent, and flight and computational migration strategies are uniformly formulated for all unmanned aerial vehicles in the system according to a centralized strategy network. This requires the drone system to collect all drone states in each timeslot set, resulting in some communication delay. And with the increase of the number of unmanned aerial vehicles and the more complex environment, the problem that the optimal strategy is difficult to obtain or the neural network is not converged occurs in the deep reinforcement learning of the single intelligent agent. Aiming at the problems, the multi-agent deep reinforcement learning only needs to acquire the reward value through interaction between each agent and the environment, so that the strategy of each agent is continuously learned and improved, and the agent does not need to obtain the global state information of the system through communication when making a decision, so that the communication delay is avoided.

Referring to fig. 1 to 3, the following provides a method for controlling an unmanned aerial vehicle based on multi-agent deep reinforcement learning, including:

s101: establishing an information acquisition task model according to parameters of the unmanned aerial vehicle group information acquisition system; the information acquisition task is divided into an acquisition subtask and a calculation subtask;

s102: constructing a deep neural network model according to the task model, and training the deep neural network model by utilizing a multi-agent deep reinforcement learning algorithm combined with an attention mechanism; wherein the agent is an unmanned aerial vehicle;

s103: and controlling the unmanned aerial vehicle group in the actual environment to complete an information acquisition task by using the trained deep neural network model.

In this embodiment, there are M unmanned aerial vehicles in the unmanned aerial vehicle system, K edge server that can supply unmanned aerial vehicle to insert. Meanwhile, the time is discretized into equal-length time slots tau, and in each time slot tau, the system needs to perform N information acquisition tasks. Before using the multi-agent deep reinforcement learning algorithm, the system model needs to be parameterized into a system state space and an agent action space, and an instant reward function is set.

The specific process of parameterizing the system state space in this embodiment is as follows:

in each time slot, the total state of the system comprises the states of N information acquisition tasks generated by the system, the states of K edge servers and the states of M unmanned aerial vehicles, and the states are respectively defined

And

order to

Is the state represented as the jth information collection task of the current time slot, wherein,

the abscissa representing the position of the j-th acquisition task,

ordinate representing the position of the j-th acquisition task, b_jIndicating the size of data that needs to be collected for the jth task. Order to

Indicating the state of the kth edge server at the current time slot, where,

representing the computation rate of the k-th edge compute server,

representing the upstream bandwidth of the k-th edge server. Order to

The state of the ith drone for the current timeslot, wherein,

an abscissa indicating the current position of the ith drone,

and the ordinate represents the current position of the ith unmanned aerial vehicle.

It is worth explaining that, in the multi-agent unmanned aerial vehicle system, when the unmanned aerial vehicle makes a specific action of the current time slot of the unmanned aerial vehicle, the unmanned aerial vehicle does not need to communicate with other unmanned aerial vehicles, so that each unmanned aerial vehicle cannot obtain all the states of the current system, and obtains a local observation value based on the total state of the system and the observation range of the unmanned aerial vehicle

Wherein T is_iSet all information points in the observation range of the unmanned aerial vehicle i in the current time slot into a state U_iFor all other unmanned aerial vehicles in the i observation range of unmanned aerial vehicle in the current time slotAnd E is the state set of all edge servers in the system.

The specific process of parameterizing the motion space of the agent in this embodiment is as follows:

observation o obtained at each known time slot itself_iUnder the circumstances of (2), each unmanned aerial vehicle needs to obtain the action according to its own policy function. Defining the action of the ith unmanned aerial vehicle at the time slot tau strategy function output as a_i＝[d_i,θ_i,ρ_i,z_i]Wherein d is_iDistance, θ, for flight of the ith unmanned aerial vehicle_iFor the angle, rho, at which the ith unmanned aerial vehicle flies in the time slot tau_iThe ratio of the calculated migration for the ith drone, z_iNumbering the edge servers accessed by the ith unmanned aerial vehicle.

The specific process of setting the reward function in this embodiment is as follows:

for the unmanned aerial vehicle information acquisition system, the goal of the unmanned aerial vehicle i is to maximize the income of the unmanned aerial vehicle i. After the unmanned aerial vehicle i finishes the information acquisition task, the unmanned aerial vehicle i can obtain certain benefits, and meanwhile, the energy consumption and the time cost of the unmanned aerial vehicle i in the process of finishing the acquisition task are also considered. Thus defining the reward function of a single drone in a slot as

G_iIndicating that drone i collects the benefits of task completion in one time slot,

indicating the time cost of drone i in one time slot,

representing the energy consumption of drone i during one time slot.

It is worth noting that the profit of a task is related to the data size of the task and is defined as

Wherein beta is_ijThat 1 indicates that the unmanned aerial vehicle i has performed information acquisition on the information point j, β_ijMeaning unmanned aerial vehicle i does not carry out information acquisition to information point j, b ═ 0_jThe total amount of data for the jth task, g is the completion yield of unit data.

Mainly obtained by adding the flight time of the unmanned aerial vehicle, the time for acquiring information and the time for carrying out data calculation, and defined as

Wherein d is_iFlight distance, v, for the ith unmanned aerial vehicle_iFor the flight rate of the ith drone,

indicating the information acquisition rate of the ith drone,

represents the calculated rate of the ith drone, and

indicating that drone i spends time computing task j; epsilon_iIs the data uploading rate of the unmanned plane i, and is an edge server z accessed by the unmanned plane i according to Shannon's theorem_iBandwidth of

So as to obtain the compound with the characteristics of,

wherein n is_ziAccessing edge servers z for the same time slot_iSNR is the signal-to-noise ratio. While

The time cost for carrying out the edge calculation is shown, and the time cost comprises the time cost for uploading data and the time cost for calculating by an edge server.

Mainly obtained by adding the flight energy consumption of the unmanned aerial vehicle, the energy consumption of information acquisition and the energy consumption of data calculation, and defined as

Wherein the content of the first and second substances,

is the power at which the drone i is flying,

is the power of the information acquisition of the unmanned plane i,

the power calculated locally for drone i,

the power of the unmanned aerial vehicle i for data uploading,

is the z th_iThe computational power of each edge server.

The multi-agent deep reinforcement learning algorithm combined with the attention mechanism in the embodiment is mainly divided into two parts, wherein the first part is to establish a deep neural network for computer simulation environment training according to an information acquisition task model; and the second part is to acquire the information acquisition and calculation migration control instruction of the unmanned aerial vehicle in the actual environment by using the trained deep neural network.

In the embodiment, a multi-agent deep reinforcement learning algorithm combined with an attention mechanism is adopted to train a deep neural network model, the adopted deep reinforcement learning algorithm is based on an Actor-Critic framework, and the deep neural network is divided into an Actor network and a Critic network. The actor network is used as a strategy function of the intelligent agent and is used for acquiring the specific action of the intelligent agent; the critic network is used for evaluating the strategy performance, namely the Q value, of the agent operator network in the training process as a function of the action state value of the agent. In the neural network training stage, an operator network and a critic network need to be trained simultaneously. It should be noted that the deep reinforcement learning algorithm of the present embodiment is a multi-agent deep reinforcement learning algorithm, that is, each agent has its own actor network.

Wherein the actor network comprises an estimated actor network and a target actor network, the actor network is a three-layer full-connection layer deep neural network, and the input of the network is the observed value o of the unmanned aerial vehicle_iAnd the output is the action a of the unmanned aerial vehicle i at the current time slot_i. The operator network is trained to obtain a better action strategy function, and is used for obtaining corresponding optimal actions according to different state inputs of the actual environment.

The critic network also comprises an estimation critic network and a target critic network, and in order to enable the critic network to obtain more accurate estimation values, the critic network adds an attention unit on the basis of a three-layer fully-connected layer deep neural network, and the structure is shown in fig. 1.

The specific method comprises the following steps: firstly, the observed value o of each unmanned aerial vehicle is determined_iAnd an action value a_iInputting the data into a single-layer full-connection layer deep neural network (1-layer MLP) to obtain a state action characteristic value g (o) of each unmanned aerial vehicle_i,a_i)；

Then, the state action characteristic values of all the unmanned aerial vehicles are input into the attention unit.

In the attention unit, the attention weight alpha of each unmanned aerial vehicle j is calculated according to the characteristic value of the unmanned aerial vehicle i and the characteristic values of the other unmanned aerial vehicles j (j ≠ i)_jThe specific attention weight is calculated as follows:

wherein the content of the first and second substances,

and W_qFor a learnable attention parameter matrix, the above formula mainly obtains an attention coefficient by performing scaling dot product (scaled dot product) on the state action characteristic value and the attention parameter matrix, and then obtains the attention weight of the unmanned aerial vehicle j by normalizing the attention coefficient by using a softmax function.

Then, calculating the influence value e of the rest of unmanned aerial vehicles on the unmanned aerial vehicle i in a weighted sum mode according to the attention weight and the feature values of the rest of unmanned aerial vehicles_i：

Wherein, W_oFor a learnable attention parameter matrix, h is a dot product operation.

Finally, the state action characteristic value g (o) of the unmanned aerial vehicle i_i,a_i) And the influence value e_iInputting the data into a double-layer full-connection layer deep neural network (2-layers MLP) to obtain the action state value Q of the unmanned aerial vehicle_i。

The training process of the deep neural network in this embodiment is as follows: firstly, building a parameterized simulation environment model by python, initializing the total state S of the system, and generating the observation value X of each unmanned aerial vehicle in the current time slot according to the total state of the system in the current time slot and the observation range of each unmanned aerial vehicle₁,o₂,…,o_M]. Observing value o of each unmanned aerial vehicle_iInputting into corresponding operator network to obtain eachAction value a of unmanned aerial vehicle_i. The simulation environment is according to the total state of the current system and the action value A ═ a of all the unmanned aerial vehicles in the current time slot₀,a₁,…,a_M]Calculating to obtain the reward R ═ R of all the unmanned planes₀,r₁,…,r_M]The system next slot state S ' and the observed value X ' ═ o '₁,o′₂,…,o′_M]. (X, a, R, X ') in a time slot is stored as an experience sample in the agent's experience pool for updating network parameters. And updating the network parameters after the number of the experience pool samples reaches a certain threshold, taking the updating of the network parameters of the unmanned aerial vehicle i as an example, and the updating steps of the network parameters of other unmanned aerial vehicles are the same.

The process of updating the estimated critic network parameters comprises the following steps: randomly taking H experience samples (X) from the experience pool^j,A^j,R^j,X′^j) J ∈ {1,2, …, H }, and the next slot in each sample j is observed for value X'^jRespectively inputting the data into the target operator network of the corresponding agent to obtain the action of all agents about the next time slot of the experience sample

The observed value X 'of the next slot in sample j'^jAnd action value A'^jInputting the value into a target critic network to obtain a target Q value of the agent i,

the Q value for agent i is obtained in an estimation criticc network that inputs the observed value and the action value for the current time slot in sample j,

repeating the steps above for all the intelligent persons and calculating the mean square error loss function of the evaluation criticic network according to the following formula, wherein the smaller the mean square error loss function value is, the more accurate the evaluation result obtained by the criticic network is, wherein gamma is the discount factor of the reward,

the prize value for agent i in the jth experience sample. Then, the parameter theta of the criticc network is updated by minimizing a loss function by using a random gradient descent method^Q。

The process of updating the estimated operator network parameters is as follows: for each agent i, inputting the observed value and the action value of the current time slot in the H sampled empirical samples into an estimation critic network to obtain a Q value, Q_i(X,a₁,a₂,…,a_M). The objective of the operator network is to maximize the Q value, and its performance function is expressed as the expected value of the Q value, and the specific formula is as follows:

wherein E is_x,a～DExpressed as the expected value, μ, of the Q value calculated using the decimated samples_i(o_i) A policy function that approximates the estimated actor network for drone i. In-pair estimation of operator network parameters

When updating, the operator network parameters are updated according to the performance function

Make a derivation

And updated using a random gradient ascent method.

The process of continuously updating the target network parameters comprises the following steps: most preferablyThen the parameter theta of the target criticc network according to the following formula^Q' and all agents i (i e {1,2, …, M })

Performing a soft update wherein

Learning rate for the target network:

and circulating the training operation for multiple times until the strategy function approximated by the estimation operator network converges.

Pseudo code for deep neural network training is as follows:

after the deep neural network is trained for multiple times, namely after the deep neural network is trained, the estimation operator network can be used for controlling the unmanned aerial vehicle group in the actual environment to complete the information acquisition task, and the method specifically comprises the following steps:

firstly, parameterizing the state of an unmanned aerial vehicle system in an actual environment and the observation of each unmanned aerial vehicle; then, inputting the parameterized observation value of each task period of the unmanned aerial vehicle into a trained operator network to obtain the action value of the task period of the unmanned aerial vehicle; and finally, converting the obtained action value into a flight instruction and a calculation migration instruction, enabling the unmanned aerial vehicle to execute flight action according to the instruction and fly to the target position, acquiring all information tasks in the range, and calculating and migrating the acquired original data according to the migration rate to complete the calculation task. The above steps are repeated at each task cycle.

The following is another embodiment of the method for controlling an unmanned aerial vehicle based on multi-agent deep reinforcement learning, which includes:

s1: building a simulation environment by using the parameterized multi-unmanned aerial vehicle information acquisition task model, and randomly initializing a system state and neural network parameters;

s2: acquiring an observed value of the current time slot of the unmanned aerial vehicle;

s3: determining the information acquisition and the calculation migration action of the unmanned aerial vehicle in the current time slot by adopting an actor network according to the observed value;

s4: calculating the reward of the current time slot and the observation value of the unmanned aerial vehicle of the next time slot according to the parameterized model, and storing the time slot experience sample into an experience pool; repeating S2-S4 until the number of experience pool samples reaches a certain threshold;

s5: randomly sampling a certain number of experience samples from the experience pool, updating the neural network parameters to obtain updated network parameters, and repeating S2-S5 until the strategy function is converged;

s6: parameterizing an information acquisition task in an actual environment, and acquiring an actual observation value of the unmanned aerial vehicle;

s7: and determining the action value of the unmanned aerial vehicle by adopting an actor network according to the observed value, and controlling the unmanned aerial vehicle to carry out information acquisition and calculation migration according to the action value.

Various methods may be used in the embodiment, including but not limited to:

(1) modifying a parameter updating formula of the deep neural network, and adopting an updating mode of deep reinforcement learning algorithms such as PPO (polyphenylene oxide), SAC (SAC) and the like

(2) Computing method for reward function R in modification model

On the basis of the technical scheme of the invention, the improvement and equivalent transformation of individual steps of the algorithm according to the principle of the invention are not excluded from the protection scope of the invention.

The embodiment models the information acquisition task of the unmanned aerial vehicle, decomposes the information acquisition task of the unmanned aerial vehicle into an acquisition subtask and a calculation subtask, and parameterizes the whole unmanned aerial vehicle system model. In the embodiment, each unmanned aerial vehicle is regarded as an intelligent agent, and the multi-intelligent-agent deep reinforcement learning only needs to acquire the reward value through interaction between each intelligent agent and the environment, so that the strategy of the intelligent agent is continuously learned and improved, and the global state information of the system is not required to be acquired through communication when the intelligent agent makes a decision, so that the communication delay is avoided.

In the embodiment, when the actor network is trained, an attention mechanism is combined, and the critic network with the attention unit is used for evaluating the performance of the actor network, so that the influence of other agents with higher similarity on the actor network can be better noticed during evaluation, more accurate evaluation values are obtained to guide the training of the actor network, and the training speed of the actor network is accelerated.

Compare with ordinary single intelligent agent degree of depth reinforcement learning algorithm with whole unmanned aerial vehicle system as an intelligent agent, every unmanned aerial vehicle is regarded as an intelligent agent to this embodiment, when carrying out the information acquisition task, every unmanned aerial vehicle only need with the observed value direct input of self in the actor network that has trained alright obtain the control instruction of this task cycle, avoided single intelligent agent degree of depth reinforcement learning algorithm to need collect all unmanned aerial vehicle's state and observed value through communication before formulating control instruction, thereby reaction delay has been reduced.

The invention also provides an embodiment of the unmanned aerial vehicle control system based on multi-agent deep reinforcement learning, which comprises the following steps:

In addition, the present invention also provides an electronic device including:

a memory for storing a computer program;

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An unmanned aerial vehicle control method based on multi-agent deep reinforcement learning is characterized by comprising the following steps:

converting parameters of the unmanned aerial vehicle group information acquisition system into a state space of the system and an action space of the intelligent agent, and setting an instant reward function;

constructing a deep neural network model according to the task model, and training the deep neural network model by utilizing a multi-agent deep reinforcement learning algorithm combined with an attention mechanism; the deep neural network model comprises an operator network and a critic network, the operator network comprises an estimated value operator network and a target operator network, the critic network comprises an estimated value critic network and a target critic network, attention units are embedded in the critic network on three full-connection layers, and the intelligent agent is an unmanned aerial vehicle;

when an actor network is trained, a critic network with an attention unit is used for evaluating the performance of the actor network, and the specific process is as follows:

firstly, the number of unmanned planes in the unmanned plane cluster is N, and the observed value o of the unmanned plane i is calculated_iAnd an action value a_iInputting the single-layer full-connection layer to obtain the state action characteristic value g (o) of each unmanned aerial vehicle_i，a_i) Inputting the state action characteristic values of all unmanned aerial vehicles into an attention unit;

the attention unit calculates attention weight alpha of the unmanned aerial vehicle j according to the characteristic value of the unmanned aerial vehicle i and the characteristic values of the other unmanned aerial vehicles j_j：

Wherein the content of the first and second substances,

and W_qI is more than or equal to 1 and less than or equal to N, and j is not equal to i;

calculating the influence value e of the rest unmanned aerial vehicles on the unmanned aerial vehicle i in a weighted sum mode according to the attention weight and the feature values of the rest unmanned aerial vehicles_i：

The state action characteristic value g (o) of the unmanned aerial vehicle i_i，a_i) And the influence value e_iThe value Q of the action state of the unmanned aerial vehicle is obtained by inputting the value Q into a double-layer full-connection layer network_i；

2. The method for controlling the multi-agent deep reinforcement learning-based unmanned aerial vehicle according to claim 1, wherein training the deep neural network model by using a multi-agent deep reinforcement learning algorithm combined with an attention mechanism specifically comprises:

s201: randomly initializing a system state and a neural network parameter;

s202: acquiring an observation value X of the current time slot of each unmanned aerial vehicle as [ o ] according to the system state and the observation range of the unmanned aerial vehicle₁，o₂，...，o_M](ii) a Wherein M is the number of unmanned aerial vehicles in the unmanned aerial vehicle cluster;

s204: according to the system state and the action value A of all the unmanned planes in the current time slot ═ a₀，a₁，...，a_M]Get the reward R ═ R for all drones₀，r₁，...，r_M]The system next slot state S ' and the observed value X ' ═ o '₁，o′₂，...，o′_M]Storing the experience samples (X, A, R, X') in an experience pool of the intelligent agent;

3. The multi-agent deep reinforcement learning-based unmanned aerial vehicle control method according to claim 1, wherein the step of controlling the unmanned aerial vehicle cluster in the actual environment by using the trained deep neural network model to complete the information acquisition task specifically comprises:

4. Unmanned aerial vehicle control system based on many intelligent agent degree of depth reinforcement study, its characterized in that includes:

the system parameter conversion module is used for converting the parameters of the unmanned aerial vehicle group information acquisition system into a state space of the system and an action space of the intelligent agent and setting an instant reward function;

the deep neural network building and training module is used for building a deep neural network model according to the task model and training the deep neural network model by utilizing a multi-agent deep reinforcement learning algorithm combined with an attention mechanism; the deep neural network model comprises an operator network and a critic network, the operator network comprises an estimated value operator network and a target operator network, the critic network comprises an estimated value critic network and a target critic network, attention units are embedded in the critic network on three full-connection layers, and the intelligent agent is an unmanned aerial vehicle;

firstly, the number of unmanned planes in the unmanned plane cluster is N, and the observed value o of the unmanned plane i is calculated_iAnd an action value a_iFor obtaining each unmanned aerial vehicle in a single-layer full-link layerCharacteristic value g (o) of state motion_i，a_i) Inputting the state action characteristic values of all unmanned aerial vehicles into an attention unit;

Wherein the content of the first and second substances,

5. A computer-readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the multi-agent deep reinforcement learning based drone controlling method according to any one of claims 1 to 3.

6. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the multi-agent deep reinforcement learning-based drone controlling method according to any one of claims 1 to 3.