CN117933622A

CN117933622A - Unmanned aerial vehicle dynamic task allocation method and device based on hierarchical reinforcement learning

Info

Publication number: CN117933622A
Application number: CN202410037582.4A
Authority: CN
Inventors: 董琦; 王若男; 刘欣雨; 尚晓舟
Original assignee: Electronic Science Research Institute Of China Electronics Technology Group Co ltd
Current assignee: Electronic Science Research Institute Of China Electronics Technology Group Co ltd
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-04-26

Abstract

The invention provides a method and a device for distributing unmanned aerial vehicle dynamic tasks based on layered reinforcement learning, wherein the method comprises the following steps: modeling a task allocation scene; collecting observation information of each actuator in each preset number of time steps, performing global task allocation by using a configured coordinator algorithm, and transmitting allocation results to the actuators so as to maximize a coordinator accumulated rewarding function of an upper layer; utilizing a configured executor algorithm to enable an executor to reach a task node in the shortest time step by taking action to complete a task corresponding to an allocation result, so that the accumulated rewarding function of the executor at the lower layer is maximized; iterating the coordinator cumulative bonus function and the executor cumulative bonus function so that the upper layer function and the lower layer function reach the cumulative maximum value respectively. The invention uses reinforcement learning algorithm to solve, can improve the overall task completion of the system under time constraint, and solves the problem of dimension explosion caused by large-scale task allocation.

Description

Unmanned aerial vehicle dynamic task allocation method and device based on hierarchical reinforcement learning

Technical Field

The invention relates to the technical field of unmanned aerial vehicles, in particular to a unmanned aerial vehicle dynamic task allocation method and device based on layered reinforcement learning.

Background

Unmanned aerial vehicle technology has obtained extensive application in each field because of its advantages such as flexibility height, efficient. The single unmanned aerial vehicle is difficult to complete long-time and large-scale tasks, so unmanned aerial vehicle cluster cooperative task allocation becomes the mainstream. The unmanned aerial vehicle cluster cooperative task allocation is widely applied and generally divided into two application fields of civil use and military use: in the civil field, the unmanned aerial vehicle cluster can be applied to scenes such as express logistics, emergency rescue, agricultural irrigation and the like; the method can be used for various cluster combat and reconnaissance tasks in the military field. Along with the increasing complexity of unmanned aerial vehicle application scenes, the existing multi-unmanned aerial vehicle task allocation algorithm technology is difficult to meet actual demands, the current unmanned aerial vehicle dynamic task allocation algorithm mainly focuses on the problems of energy consumption and path planning generated in the allocation process, and the problem of dimension explosion caused by low task completion degree and increased task quantity due to time constraint under the high dynamic task allocation condition is ignored.

Disclosure of Invention

The invention aims to solve the technical problems of inaccurate task allocation and dimensional explosion in the existing unmanned aerial vehicle dynamic task allocation algorithm; in view of the above, the present invention provides a method and apparatus for unmanned aerial vehicle dynamic task allocation based on hierarchical reinforcement learning.

The technical scheme adopted by the invention is that the unmanned aerial vehicle dynamic task allocation method based on layered reinforcement learning comprises the following steps:

S1, modeling a task allocation scene, wherein a control end is used as a coordinator of an upper layer, and an unmanned aerial vehicle is used as an executor of a lower layer;

s2, collecting observation information of each actuator in each preset number of time steps, performing global task allocation by using a configured coordinator algorithm, and transmitting allocation results to the actuators so as to maximize a coordinator accumulated rewarding function of an upper layer;

s3, utilizing a configured executor algorithm to enable the executor to reach a task node in the shortest time step by taking action to complete the task corresponding to the distribution result, so that the accumulated rewarding function of the executor at the lower layer is maximized;

And S4, iterating the coordinator cumulative rewarding function and the executor cumulative rewarding function so that the upper layer function and the lower layer function respectively reach the cumulative maximum value.

In one embodiment, the modeling the task allocation scenario includes:

Initializing M unmanned aerial vehicles and N task nodes, wherein each task has a starting time and an ending time, and each task is assigned to the unmanned aerial vehicle as a target, and the unmanned aerial vehicle follows a preset time constraint.

In one embodiment, the step S2 includes:

Training is performed by adopting an actuator evaluator algorithm to determine a vector corresponding to the task allocation result.

In one embodiment, the step S3 includes:

Training with deep reinforcement learning, wherein each time step of the actor selects an action to perform, the actions comprising: move in up, down, left and right directions.

The invention also provides an unmanned aerial vehicle dynamic task allocation device based on layered reinforcement learning, which comprises:

the modeling unit is configured to model a task allocation scene, wherein the control end is used as a coordinator of an upper layer, and the unmanned aerial vehicle is used as an executor of a lower layer;

The upper layer unit is configured to collect observation information of each actuator in each preset number of time steps, perform global task allocation by using a configured coordinator algorithm, and transmit allocation results to the actuators so as to maximize a coordinator accumulated rewarding function of the upper layer;

The lower layer unit is configured to utilize a configured actuator algorithm to enable the actuator to reach a task node in the shortest time step by taking action to complete the task corresponding to the distribution result so as to maximize the accumulated rewarding function of the actuator of the lower layer;

And the iteration unit is configured to iterate the coordinator cumulative reward function and the actuator cumulative reward function so that the upper layer function and the lower layer function reach the cumulative maximum value respectively.

In one embodiment, the modeling unit is further configured to:

In one embodiment, the upper layer unit is further configured to:

In one embodiment, the lower layer unit is further configured to:

Another aspect of the present invention also provides an electronic device including: a memory, a processor, and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the unmanned aerial vehicle dynamic task allocation method for hierarchical reinforcement learning as set forth in any of the above.

Another aspect of the present invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the hierarchical reinforcement learning based unmanned aerial vehicle dynamic task allocation method as described in any of the above.

By adopting the technical scheme, the invention has at least the following advantages:

The invention introduces a layered reinforcement learning algorithm in a multi-agent reinforcement learning system, divides the dynamic task allocation problem into two layers of sub-problems with different time scales, wherein the upper layer is a central decision point which is used as a coordinator for global task planning, and the lower layer is an unmanned aerial vehicle group which is used as an executor for communication perception and route planning. Firstly, a central decision point collects task information perceived by an unmanned aerial vehicle and quantifies the information to be used as a dynamic decision basis; the central decision point gives out global task allocation information and transmits the global task allocation information to the unmanned aerial vehicle; and the unmanned aerial vehicle autonomously carries out route planning according to the task allocation information so as to reach a target task node, and collects new task information in the process.

Based on the method, the time constraint of dynamic task allocation is fully considered, and the reinforcement learning algorithm is used for solving, so that the overall task completion degree of the system under the time constraint can be improved, the problem of dimensional explosion caused by large-scale task allocation is solved, and meanwhile, the autonomy of the unmanned aerial vehicle is improved.

Drawings

FIG. 1 is a schematic flow diagram of a method for unmanned aerial vehicle dynamic task allocation based on hierarchical reinforcement learning according to an embodiment of the present invention;

FIG. 2 is a logic diagram of a hierarchical reinforcement learning-based unmanned aerial vehicle dynamic task allocation method according to an embodiment of the present invention;

Fig. 3 is a schematic diagram of a time instant at which a unmanned aerial vehicle performs a task according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-agent hierarchical reinforcement learning algorithm (upper level unit algorithm) framework according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a three-dimensional exemplary scenario for task allocation of a build drone according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of the composition of a hierarchical reinforcement learning-based unmanned aerial vehicle dynamic task allocation device according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the present invention for achieving the intended purpose, the following detailed description of the present invention is given with reference to the accompanying drawings and preferred embodiments.

In the drawings, the thickness, size and shape of the object have been slightly exaggerated for convenience of explanation. The figures are merely examples and are not drawn to scale.

It will be further understood that the terms "comprises," "comprising," "includes," "including," "having," "containing," and/or "including," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Furthermore, when a statement such as "at least one of the following" appears after a list of features that are listed, the entire listed feature is modified instead of modifying a separate element in the list. Furthermore, when describing embodiments of the present application, the use of "may" means "one or more embodiments of the present application. Also, the term "exemplary" is intended to refer to an example or illustration.

As used herein, the terms "substantially," "about," and the like are used as terms of a table approximation, not as terms of a table level, and are intended to illustrate inherent deviations in measured or calculated values that would be recognized by one of ordinary skill in the art.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

In a first embodiment of the present invention, as shown in fig. 1, a method for allocating unmanned aerial vehicle dynamic tasks based on hierarchical reinforcement learning includes the following steps:

step S1, modeling a task allocation scene, wherein a control end is used as a coordinator of an upper layer, and an unmanned aerial vehicle is used as an executor of a lower layer;

s2, collecting observation information of each actuator in each preset time step number, performing global task allocation by using a configured coordinator algorithm, and transmitting allocation results to the actuators so as to maximize a coordinator accumulated rewarding function of an upper layer;

Step S3, utilizing a configured executor algorithm to enable the executor to reach a task node in the shortest time step by taking action to complete the task corresponding to the distribution result, so that the accumulated rewarding function of the executor at the lower layer is maximized;

and S4, iterating the coordinator accumulated rewarding function and the executor accumulated rewarding function to enable the upper layer function and the lower layer function to reach accumulated maximum values respectively.

The method provided in this embodiment will be described in detail below with reference to fig. 2 to 5.

Step S1, modeling a task allocation scene, wherein a control end is used as a coordinator of an upper layer, and an unmanned aerial vehicle is used as an executor of a lower layer.

Specifically, a task allocation scenario is modeled: initializing M unmanned aerial vehicles and N task nodes, wherein each task has a starting time and an ending time, and each task is distributed to the unmanned aerial vehicles as a target. Meanwhile, the unmanned aerial vehicle needs to follow a certain time constraint.

S11, a time sequence constraint relation exists between tasks, a certain execution time is needed, and the constraint of the unmanned aerial vehicle is established as follows:

st_j≥st_i+dt_i,i,j＝1,2,...,M

Where st _i is the start time of T _i and dt _i is the duration.

As shown in fig. 3, T _i represents a task number; t _starti represents a task start time; t _endi represents the latest start time of the tasks, each task must start arriving between (t _starti,t_endi); t _durationi represents a task duration. The same drone can only perform one task at a time, and task T _j must be performed after task T _i is completed.

S12, after the unmanned aerial vehicle finishes processing the task T _i, the unmanned aerial vehicle accepts the task T _j and needs to reach a new task execution area, so the following conditions need to be met:

st_j≥st_i+dt_i+dt_ij,i,j＝1,2,...,M

Where st _i is the start time of T _i, dt _i is the duration, and dt _ij is the time taken from the original task node to the new task node.

And S2, collecting the observation information of each actuator in each preset number of time steps, performing global task allocation by using a configured coordinator algorithm, and transmitting an allocation result to the actuators so as to maximize the coordinator accumulated rewarding function of an upper layer.

Specifically, an algorithm framework in multi-agent hierarchical reinforcement learning is designed, and as shown in fig. 4, the algorithm framework is divided into a coordinator and an executor, and centralized task allocation and distributed path planning are respectively carried out. Setting a central decision point as a coordinator, collecting observation information of each actuator (namely the unmanned aerial vehicle) as input data every K time steps, giving global task allocation by using a reinforcement learning algorithm, and transmitting allocation results to the actuators.

S21, taking the central decision point as a coordinator, and training by adopting an AC algorithm. Fig. 3 shows the overall framework of the algorithm. Coordinator (coordinator)And collecting observation information obtained from each actuator, performing global planning, and distributing a dedicated task target for each actuator. The method comprises the following specific steps:

S22, firstly, the coordinator collects the joint observation value information transmitted by the executor in K time steps Where o _i,t is the observed value ,o_i,t＝((x_i,t,y_i,t,z_i,t),(t_startm,t_endm,x_m,y_m,z_m)···), of the agent i at time t, where (x _i,t,y_i,t,z_i,t) represents the current position of the unmanned aerial vehicle itself, (t _startm,t_endm,x_m,y_m,z_m) represents the collected start time, end time information and task position information of the task node M, and m=1, 2, …, M.

S23, the coordinator adopts a self-attention mechanism to process the observed valueThe self-attention mechanism method can process sequences with different lengths so as to complete the dynamic task planning problem of different unmanned aerial vehicle numbers, target numbers and distribution conditions. The calculation formula of the scaled dot product attention adopted by the invention is as follows:

Wherein d _k is the dimension of the key; the matrix K, Q, V is the key, query, value of the input matrix X transformed by the parameter matrix W _q、W_k、W_v, respectively. The calculation formula of the matrix K, Q, V is respectively:

Q＝tanh(Wq_X)

K＝tanh(W_kX)

V＝tanh(W_vX)

S24, collecting the observation value information In the input attention module, a state representation s which is irrelevant to the input sequence is obtained, and the s is used as the input of a coordinator decision layer. The coordinator decision layer is an AC structure and includes a policy network Actor and an evaluation network Critic, and specific training steps are as follows.

S25, firstly, initializing a coordinator policy network Actor and evaluating network structure and parameters of a network Critic. The strategy network Actor is set as a fully-connected neural network, the input is a state representation s, and the output is a joint task allocation valueWhere g _i,t is a vector in the j dimension, each element represents a time period for agent i to perform task j (j=0, 1, the probability of m).

S26, the agent i selects the task with the highest probability to execute. Then, the policy network Actor gets a timely prize according to the upper layer prize function and enters the next state s'. The design of the bonus function is as follows:

And S27, determining a coordinator rewarding function according to the covered rate of the number of the task nodes at the moment t, wherein I _j,t represents that the task nodes j are covered at the moment t. When none of the tasks is covered at time t, a punitive reward function-0.1 is given. Thus, the coordinator's reward function is designed to:

S28, inputting the state representation S, the next state S', and the rewarding value obtained by the Actor network into the evaluation network Critic, outputting a value function V (S) by the evaluation network Critic, and calculating TD-error as a loss function of the evaluation network Critic. The calculation formula of TD-error is as follows:

TD-error＝gamma*V(s)'+r-V(s)

Wherein V(s) may be approximated according to a reward function as:

The training objective of the evaluation function is to reduce the loss value as much as possible, and then the trained value function V(s) output by Critic is used for guiding the parameter update of the strategy network Actor, so that the strategy network Actor is updated towards the direction capable of obtaining larger benefits.

And step S3, utilizing a configured actuator algorithm, enabling the actuator to reach a task node in the shortest time step by taking action to complete the task corresponding to the distribution result, so as to maximize the accumulated rewarding function of the lower-layer actuator.

Specifically, as shown in fig. 3, an actuator i (i=1, 2, policy network n)The unmanned aerial vehicle acts as an executor with the goal of completing a given task by taking a series of actions to reach the task node in as short a time step as possible. The performer is trained using deep reinforcement learning (Proximal Policy Optimization, PPO), with each time step selecting one action to perform (up/down/left/right four method movements can be made) so that the cumulative rewards function of the performer is maximized.

S31, all executors adopt a PPO algorithm to construct two strategy networks and an evaluation network. The executor receives the task from the coordinatorAnd its own observations o_i,t＝(x_i,t,y_i,t,z_i,t),(t_startm,t_endm,x_m,y_m,z_m)···). lower level executives use the target condition filter according to/>And cleaning the observation value of the self so as to remove the observation information irrelevant to the execution target and reduce the redundancy of the input, thereby obtaining a clean input serving as a state s. For example g _i＝[1,0,1],o_i＝[o_i,1,o_i,2,o_i,3 ], thenWill be treated/>And/> The joint formation state s serves as an input to the lower level actuator. The training method of the lower layer executor is as follows:

S32, firstly initializing two fully-connected neural networks (an Actor strategy network and an actor_old strategy network) with parameters theta, and an evaluation network Critic. The Actor policy network is used for selecting the actions of an agent, training update and parameter back propagation are needed, and the network parameters of the actor_old are copied to the Actor, so that gradient back propagation is not needed. Setting learning rate and optimizer of the Actor and Critic networks, and initializing eclipse and other super parameters.

S33, the state of an executor consists of position, speed and acceleration, and the motion state at the time t+1 is defined as follows:

Where p ^t is the position of the drone at the current time, v ^t is the speed of the drone at the current time, and a ^t is the acceleration of the drone at the current time. I is an acceleration coefficient vector; τ is the velocity coefficient vector.

S34, in the present invention, it is assumed that the flying speed and acceleration of the unmanned aerial vehicle are constant. Therefore, the operation space of the lower actuator i is defined asWherein/>And/>Respectively the deflection angle and the pitch angle of the unmanned aerial vehicle. Therefore, the motion function can be moved up, down, left, and right by using the two-dimensional vector.

S35, for each actuator of the lower layer, no collision needs to occur and the task node is reached as soon as possible, so the reward function of the actuator is designed as follows:

wherein, Representation normalization; the expression of regularization; /(I)Coordinates representing the position of the unmanned aerial vehicle; representing coordinates of the target location.

In order to maintain a certain distance from the obstacle, a threat function is defined, which represents the threat level of the obstacle to the route planning:

Thus, the actuator's reward function is:

S36, inputting a state S and an Actor strategy network selection action, and outputting the action And probability distributions thereof. Critic evaluation network based on action/>The state s, the next state s', the instant prize currently obtained, and the merit function a _t are calculated. Obtaining/>, based on the dominance functionAs a loss function of Critic evaluation network, the calculation formula is as follows:

The policy network updates the parameters by:

The Critic evaluation network scores the actions given by the Actor policy network, so that the Actor policy network performs iterative optimization. The lower-layer executor selects actions by using the trained Actor strategy network to carry out path planning, obstacle avoidance and other motion decision tasks so as to reach task nodes as soon as possible.

Specifically, three-dimensional typical scenes can be distributed through constructing unmanned aerial vehicle tasks and iterative optimization of an algorithm is carried out, algorithm parameters are continuously adjusted, so that the functions of an upper layer and a lower layer respectively reach the accumulated maximum value, and the convergence speed of the algorithm is accelerated. The algorithm of the patent is compared with the task completion degree of the traditional unmanned aerial vehicle dynamic programming method.

S41, constructing a three-dimensional typical scene of unmanned aerial vehicle task allocation, and setting a combat scene of 100km x100 km as shown in fig. 5. During initialization, barriers with different sizes are randomly generated in the environment, and N task nodes are initialized, wherein the number of the N task nodes is the same as that of unmanned aerial vehicles. The task nodes are randomly generated on the plane of the XY axes. The new task is issued at a different time. The executors do not need to communicate, only need to communicate with the coordinator once every K time steps, the maximum time step of each training is 1000, and the maximum number of task nodes is M=50.

S42, carrying out optimization iteration by the algorithm, setting the total iteration number of solving as L, and setting the current iteration number as L. In the experiment, the total iteration number is set to be L=50000, meanwhile, the current rewarding function of the coordinator is calculated, after the rewarding function of the coordinator converges, training is completed, and final benefits and task coverage rate are calculated. Coordinator parameters and actuator parameters are set as follows:

Compared with a centralized decision algorithm, the method adopts a hierarchical reinforcement learning algorithm to decompose the problem into two layers with different time scales, thereby greatly reducing the solving difficulty; and the information is collected and communicated once every K steps, so that the task planning speed is ensured, and the task allocation completion degree is improved.

The invention adopts the task allocation rewarding convergence rate and the rewarding function value r ^H,r^L as the index of model evaluation.

The invention selects the traditional task dynamic programming method (linear optimization) and other baseline multi-agent algorithm (MADDPG, QMIX, COMA) to carry out comparison experiments, and the comparison results of convergence speed and average rewards are shown in the following table:

From the above results, the multi-unmanned aerial vehicle dynamic task allocation algorithm based on hierarchical reinforcement learning provided by the invention is superior to the traditional task dynamic planning method (linear optimization) and other baseline multi-agent algorithms (MADDPG, QMIX, COMA) in convergence rate; is superior to the current baseline multi-agent algorithm (MADDPG, QMIX, COMA) in average prize value. The above results prove the feasibility and superiority of the multi-unmanned aerial vehicle dynamic task allocation algorithm based on the hierarchical reinforcement learning.

The invention adopts the self-attention module to process the variable-length input and obtain a state representation irrelevant to the input sequence, can flexibly process the number of unmanned aerial vehicles with variable number and the number of task nodes, and effectively improves the generalization capability. In addition, a target condition filter is added in the lower algorithm, so that an intelligent agent does not pay attention to unallocated tasks, and difficulty and complexity of model training are reduced. The feasibility of the invention proves that the algorithm can be popularized and can be expanded to other intelligent optimization problems of groups so as to improve the performance of the algorithm.

In summary, compared with the prior art, the present embodiment has at least the following advantages:

1) The invention improves the traditional unmanned aerial vehicle dynamic planning method. Compared with a centralized decision algorithm, the method adopts a hierarchical reinforcement learning algorithm to decompose the problem into two layers with different time scales, thereby greatly reducing the solving difficulty; and the information is collected and communicated once every K steps, so that the task planning speed is ensured, and the task allocation completion degree is improved.

2) The invention adopts the self-attention module to process the variable-length input and obtain a state representation irrelevant to the input sequence, can flexibly process the variable number of task nodes, and improves the generalization capability. And, adopt the goal condition filter, reduce the redundant information of the lower floor actuator.

3) Compared with other traditional methods such as linear programming and general multi-agent reinforcement learning, the algorithm framework provided by the invention has the advantages that the convergence rate and the average rewarding value are improved, the task allocation is better completed, and more benefits are obtained. The feasibility of the invention proves that the algorithm can be popularized and can be expanded to other intelligent optimization problems of groups so as to improve the performance of the algorithm.

The second embodiment of the present invention, corresponding to the first embodiment, introduces a unmanned aerial vehicle dynamic task allocation device based on hierarchical reinforcement learning, as shown in fig. 6, comprising the following components:

In one embodiment, the modeling unit is further configured to:

In one embodiment, the upper layer unit is further configured to:

In one embodiment, the lower layer unit is further configured to:

A third embodiment of the present invention, as shown in fig. 7, can be understood as a physical device, including a processor and a memory storing processor-executable instructions, which when executed by the processor, perform the following operations:

In the fourth embodiment of the present invention, the procedure of the unmanned aerial vehicle dynamic task allocation method based on hierarchical reinforcement learning in the present embodiment is the same as that in the first, second or third embodiments, except that in engineering implementation, the present embodiment may be implemented by means of software plus a necessary general hardware platform, and of course may also be implemented by hardware, but in many cases, the former is a better implementation. Based on such understanding, the method of the present invention may be embodied in the form of a computer software product stored on a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) comprising instructions for causing an apparatus to perform the method of the embodiments of the present invention.

While the invention has been described in connection with specific embodiments thereof, it is to be understood that these drawings are included in the spirit and scope of the invention, it is not to be limited thereto.

Claims

1. The unmanned aerial vehicle dynamic task allocation method based on hierarchical reinforcement learning is characterized by comprising the following steps of:

2. The unmanned aerial vehicle dynamic task allocation method based on hierarchical reinforcement learning of claim 1, wherein modeling the task allocation scenario comprises:

3. The unmanned aerial vehicle dynamic task allocation method based on hierarchical reinforcement learning according to claim 2, wherein the step S2 comprises:

4. The unmanned aerial vehicle dynamic task allocation method based on hierarchical reinforcement learning of claim 3, wherein the step S3 comprises:

5. Unmanned aerial vehicle dynamic task allocation device based on layering reinforcement study, characterized by comprising:

6. The unmanned aerial vehicle dynamic task allocation device based on hierarchical reinforcement learning of claim 5, wherein the modeling unit is further configured to:

7. The unmanned aerial vehicle dynamic task allocation device based on hierarchical reinforcement learning of claim 6, wherein the upper layer unit is further configured to:

8. The unmanned aerial vehicle dynamic task allocation device based on hierarchical reinforcement learning of claim 7, wherein the lower-layer unit is further configured to:

9. An electronic device, the electronic device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the hierarchical reinforcement learning based unmanned aerial vehicle dynamic task allocation method as claimed in any one of claims 1 to 4.

10. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the unmanned aerial vehicle dynamic task allocation method for hierarchical reinforcement learning as claimed in any one of claims 1 to 4.