CN111026147A

CN111026147A - Zero overshoot unmanned aerial vehicle position control method and device based on deep reinforcement learning

Info

Publication number: CN111026147A
Application number: CN201911363490.0A
Authority: CN
Inventors: 单光存; 张一楠
Original assignee: Everlasting Technology Hangzhou Co ltd; Beihang University
Current assignee: Everlasting Technology Hangzhou Co ltd; Beihang University
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-04-17
Anticipated expiration: 2039-12-25
Also published as: CN111026147B

Abstract

The utility model provides a zero overshoot unmanned aerial vehicle position control method based on deep reinforcement learning, including: s1, constructing a reinforcement learning training frame aiming at unmanned aerial vehicle speed control based on a near-end strategy optimization algorithm, and training an unmanned aerial vehicle control model by combining a feature extraction network to obtain the unmanned aerial vehicle speed control model; s2, control unmanned aerial vehicle add PID control ring in the unmanned aerial vehicle control model outside, carry out the optimal search to the PID parameter, utilize PID control algorithm to turn into unmanned aerial vehicle speed control 'S control model into unmanned aerial vehicle position control' S control model to overshoot among the elimination position control. This is disclosed can realize allowing the effective speed control of quiet tolerance within range unmanned aerial vehicle to according to unmanned aerial vehicle's effective speed control, can further realize zero overshoot unmanned aerial vehicle position control.

Description

Zero overshoot unmanned aerial vehicle position control method and device based on deep reinforcement learning

Technical Field

The disclosure relates to the field of unmanned aerial vehicles, in particular to a zero overshoot unmanned aerial vehicle position control method and device based on deep reinforcement learning.

Background

The unmanned aerial vehicle as a typical scene of an unmanned system has many practical application examples in various fields after years of development, and can help people to solve many practical problems under the trend of increasing intelligence of the current society. And four rotor unmanned aerial vehicle are because its low price, and mobility is strong, and the advantage that the structure is light obtains considerable attention in recent years, and the inevitable great progress has been had in the aspect of realizing in theory, application and industry actual production etc..

In military terms, the quad-rotor unmanned aerial vehicle has very high application priority in special scenes such as personnel search and rescue, detection and exploration and the like due to the advantages of high maneuverability and small volume. And in civilian aspect, the four rotor unmanned aerial vehicle through standardized volume production also has remarkable performance in the aspects such as commodity circulation transportation, conflagration early warning, crop protection, high altitude shooting.

Disclosure of Invention

Technical problem to be solved

The present disclosure provides a zero overshoot drone position control method and apparatus based on deep reinforcement learning to at least partially solve the above-mentioned technical problems.

(II) technical scheme

According to one aspect of the disclosure, a zero overshoot unmanned aerial vehicle position control method based on deep reinforcement learning is provided, which includes:

s1, constructing a reinforcement learning training frame aiming at unmanned aerial vehicle speed control based on a near-end strategy optimization algorithm, and training an unmanned aerial vehicle control model by combining a feature extraction network to obtain the unmanned aerial vehicle speed control model;

s2, control unmanned aerial vehicle add PID control ring in the unmanned aerial vehicle control model outside, carry out the optimal search to the PID parameter, utilize PID control algorithm to turn into unmanned aerial vehicle speed control 'S control model into unmanned aerial vehicle position control' S control model to overshoot among the elimination position control.

In some embodiments, in step S1, the control model for controlling the speed of the drone is a markov model, and the observation states are:

wherein

The velocities in the x, y, z directions, respectively, and phi_t，θ_t，ψ_tThe unmanned aerial vehicle posture expressed by Euler angle is adopted,

the corresponding angular velocity of the Euler angle of the unmanned aerial vehicle.

In some embodiments, the step S1 includes:

and alternately training an unmanned aerial vehicle control algorithm unit and a state evaluation unit of a reinforcement learning training framework aiming at the speed control of the unmanned aerial vehicle, and fitting out the optimal mapping from the current state of the unmanned aerial vehicle to the control signal of the unmanned aerial vehicle.

In some embodiments, the drone control algorithm unit maps the current state of the drone to a control signal, the control signal being a drone speed control signal;

the state evaluation unit evaluates the current state of the unmanned aerial vehicle, and the evaluation standard is the difference between the current state and the target state.

In some embodiments, the state evaluation unit determines a difference between the current speed and the target control speed, and if the difference is larger or the time required for implementing the speed state transition is longer, the output value of the state evaluation function is smaller, otherwise, the output value is larger.

In some embodiments, the determining the difference between the current speed and the target control speed comprises:

and calculating the integral of the difference between the speed of the unmanned aerial vehicle and the target speed of the unmanned aerial vehicle in a time range, wherein the faster the unmanned aerial vehicle reaches the target speed, the faster the effective control can be realized.

In some embodiments, the state evaluation unit employs a reward mechanism, and the final control model converges to speed optimal control at the incentive of the reward mechanism, and the reward mechanism employs a reward function of:

wherein the content of the first and second substances,

the differences between the current speed and the target speed of the unmanned plane on three coordinate axes of x, y and z respectively,

a penalty function is constructed according to the speed difference of the unmanned aerial vehicle; k is a proportionality coefficient between the speed gap and the penalty amount; r is_scalarA normal amount such that the reward is positive; the bonus term awards the control agent for speed control within the allowable error.

In some embodiments, the unmanned aerial vehicle control algorithm unit and the state evaluation unit adopt a feature extraction network structure, the feature extraction networks of the unmanned aerial vehicle control algorithm unit and the state evaluation unit respectively comprise an input layer, a first fully-hidden layer, a second fully-hidden layer and an output layer, the dimensions of the input layer, the first fully-hidden layer and the second fully-hidden layer are the same, and the dimension output by the unmanned aerial vehicle control algorithm unit is determined according to the control strategy parameters of the unmanned aerial vehicle power device.

In some embodiments, the step S2 includes:

providing the desired target position for the drone, resolving the target speed of the drone by means of a PID method, i.e.

Wherein, K_p，K_dFor proportional and derivative term coefficients of the drone, e_tIs the difference between the unmanned plane position and the target position,

is the target speed of the unmanned aerial vehicle;

obtaining the target speed by reinforcement learningDegree of rotation

Mapping to the speed of the power device of the unmanned aerial vehicle, so that a control model for controlling the speed of the unmanned aerial vehicle is converted into a control model for controlling the position of the unmanned aerial vehicle; the control model of the unmanned aerial vehicle position control is combined with the PID algorithm model, the integral term of the PID algorithm model is deleted, the proportional term and the differential term are reserved, and the unmanned aerial vehicle zero overshoot position control is achieved.

According to another aspect of the present disclosure, there is provided a zero overshoot drone position control device based on deep reinforcement learning, including:

a readable storage medium to store executable instructions;

one or more processors executing the control method as described above according to the executable instructions.

(III) advantageous effects

According to the technical scheme, the zero overshoot unmanned aerial vehicle position control method and device based on deep reinforcement learning at least have one of the following beneficial effects:

(1) by constructing a reinforcement learning training frame aiming at the speed control of the unmanned aerial vehicle, after a specific allowable static range is set, the effective speed control of the unmanned aerial vehicle within the allowable static range can be realized;

(2) according to the effective speed control of the unmanned aerial vehicle, the position control of the unmanned aerial vehicle with zero overshoot can be further realized by combining a PID algorithm model for removing an integral term.

Drawings

Fig. 1 is a flowchart of a method for controlling a position of a zero overshoot unmanned aerial vehicle based on deep reinforcement learning according to an embodiment of the present disclosure.

FIG. 2 is a block diagram of a reinforcement learning algorithm of the present disclosure.

Fig. 3 is a schematic structural diagram of a feature extraction network of the unmanned aerial vehicle control algorithm unit and the state evaluation unit in the embodiment of the present disclosure.

Fig. 4 is a flowchart of the unmanned aerial vehicle position control according to the embodiment of the present disclosure.

Fig. 5 is a schematic diagram of a three-axis response curve controlled by the drone according to an embodiment of the present disclosure.

Fig. 6 is a flowchart of a method for arranging and placing articles by using an unmanned aerial vehicle according to an embodiment of the present disclosure.

Detailed Description

The unmanned aerial vehicle is used as a highly nonlinear system, solving the optimal control solution is a very complex problem, and only an approximate solution can be solved under most conditions. On the problem of extremely long time period, for solving the optimal solution under the long period by the discrete system, the deviation problem caused by the approximate solution is more serious, so that the control problem of the complex system is difficult to solve effectively. In recent years, the method for deep learning has excellent performance in the fields of images, voice and the like for complex nonlinear mapping problems, provides a very good tool for solving complex nonlinear system problems, represents the complex mapping of a nonlinear system through the excellent fitting performance of a neural network, and converts the process of solving the optimal solution of the nonlinear system into a data sample-driven fitting problem.

On the application problem of the control of the unmanned aerial vehicle, many learners already do corresponding work, such as the problems of stable hand throwing, speed control and obstacle avoidance of the unmanned aerial vehicle. The application of reinforcement learning to drone control problems is therefore efficient and feasible. The study of scholars and a large number of training experiments show that the overshoot of the unmanned aerial vehicle in the flight control process cannot be limited by training the unmanned aerial vehicle control agent by adopting a reinforcement learning algorithm. Under this kind of condition, often the higher overshoot problem has in the unmanned aerial vehicle control agent who obtains through reinforcement learning training, can't satisfy unmanned aerial vehicle descending and unmanned aerial vehicle and keep away the sensitive problem of position control overshoot such as barrier, easily the dangerous situation of unmanned aerial vehicle appears.

In order to solve the problems, the disclosure provides a zero overshoot unmanned aerial vehicle position control method based on deep reinforcement learning combined with a PID control algorithm.

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

Certain embodiments of the present disclosure will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the disclosure are shown. Indeed, various embodiments of the disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements.

In one exemplary embodiment of the disclosure, a zero overshoot drone position control method based on deep reinforcement learning is provided.

Firstly, the speed control of the unmanned aerial vehicle is constructed into a Markov model, and the observation state of the Markov model can be written as follows:

wherein the content of the first and second substances,

the corresponding angular velocity of the Euler angle of the unmanned aerial vehicle. Complete information for unmanned aerial vehicle speed control can be acquired after the state exists, and then the unmanned aerial vehicle control model is trained through reinforcement learning by setting an evaluation mechanism. The evaluation mechanism that this embodiment adopted mainly is the integral of the gap of unmanned aerial vehicle speed and unmanned aerial vehicle target speed in time range, consequently if unmanned aerial vehicle reaches target speed sooner then can realize effective control sooner.

The unmanned aerial vehicle control model of this embodiment outputs for four screw rotational speeds of unmanned aerial vehicle:

[Ω₁，Ω₂，Ω₃，Ω₄]

by comparing the difference between the current state and the target state, the rotating speed of the controlled propeller is output by means of the action output network, and the network is optimized by means of the reinforcement learning algorithm, so that the expected target, namely the speed is effectively controlled. After the speed is effectively controlled, the position of the unmanned aerial vehicle can be effectively controlled through a PID method.

Fig. 1 is a flowchart of a method for controlling a position of a zero overshoot unmanned aerial vehicle based on deep reinforcement learning according to an embodiment of the present disclosure. As shown in fig. 1, the zero overshoot unmanned aerial vehicle position control method based on deep reinforcement learning of the present disclosure includes:

Each step of the zero overshoot unmanned aerial vehicle position control method based on deep reinforcement learning according to the embodiment of the present disclosure is specifically described below.

FIG. 2 is a block diagram of a reinforcement learning algorithm of the present disclosure. The reinforcement learning algorithm is based on a near-end strategy optimization algorithm framework, and as shown in fig. 2, the algorithm framework comprises an unmanned aerial vehicle control algorithm unit and a state evaluation unit.

Wherein, to four rotor unmanned aerial vehicle, unmanned aerial vehicle control algorithm unit is used for mapping unmanned aerial vehicle current state as control signal, and its control signal is four screw rotational speeds in four rotor models.

The state evaluation unit is used for evaluating the current state of the unmanned aerial vehicle, and the evaluation standard is the difference between the current speed and the target control speed. If the difference is large or it takes a long time to achieve the speed state transition, the output value of the state evaluation function will be small, and conversely, it will be large.

Specifically, the reinforcement learning algorithm of the present disclosure includes two steps:

the first step S1 is a training phase, in which the unmanned aerial vehicle control algorithm unit and the state evaluation unit are trained, and the optimal mapping from the current state of the unmanned aerial vehicle to the four control signals of the unmanned aerial vehicle is finally fitted through the alternate training of the unmanned aerial vehicle control algorithm unit and the state evaluation unit, that is, the unmanned aerial vehicle control algorithm unit obtains the control model for controlling the speed of the unmanned aerial vehicle.

In the reinforcement learning algorithm of the present disclosure, because the features that the control algorithm unit needs to extract are similar to the features that the state evaluation unit needs to extract, a feature extraction network with a similar structure is designed for the unmanned aerial vehicle control algorithm unit and the state evaluation unit, and fig. 3 is a schematic structural diagram of the feature extraction network of the unmanned aerial vehicle control algorithm unit and the state evaluation unit in the embodiment of the present disclosure. As shown in fig. 3, the feature extraction networks of the unmanned aerial vehicle control algorithm unit and the state evaluation unit each include an input layer, a first fully-hidden layer, a second fully-hidden layer, and an output layer. The two feature extraction networks only have difference in the dimension of an output layer, wherein the output dimension of the unmanned aerial vehicle control algorithm unit is 8 layers, and the output dimension of the state evaluation unit is 1 layer. The specific arrangement of the unmanned aerial vehicle control algorithm unit and the state evaluation unit is shown in table 1.

TABLE 1

Specifically, the 8-dimensional quantities output by the unmanned aerial vehicle control algorithm unit are parameters of four propeller control strategies respectively, the strategy of each propeller is represented by a Beta distribution, and the Beta distribution needs two quantities a and b to be expressed, so that the total output dimension is 8.

Further, the state evaluation unit adopts a reward mechanism:

wherein the content of the first and second substances,

the penalty function is constructed according to the speed difference of the unmanned aerial vehicle, wherein k is a proportionality coefficient between the speed difference and the penalty amount; r is_scalarThe regular amount is used to make the reward positive, so as to ensure the convergence of the algorithm, and finally the bonus item is used for rewarding the control agent when the speed control reaches the allowable error range. The final control model converges to speed-optimal control at the excitation of the reward function.

The second step S2 of the reinforcement learning algorithm of this embodiment is a drone position control phase, i.e. performing active position control on the drone. At the moment, a state evaluation unit in the algorithm frame is omitted, and the unmanned aerial vehicle is controlled through a speed control model of an unmanned aerial vehicle control algorithm unit; meanwhile, optimal search is carried out on PID parameters of the control algorithm, the speed control algorithm of the unmanned aerial vehicle is converted into a position control algorithm by the PID control algorithm, and overshoot in position control is eliminated.

Specifically, fig. 4 is a flowchart of the unmanned aerial vehicle position control according to the embodiment of the present disclosure. As shown in fig. 4, the drone position control process includes:

firstly, a desired target position is provided for the unmanned aerial vehicle, and the target speed of the unmanned aerial vehicle is solved through a PID method, namely

is the target speed of the unmanned aerial vehicle.

The target speed

The mapping to the unmanned aerial vehicle rotor speed can be obtained by the reinforcement learning model, so that the control model for controlling the speed of the unmanned aerial vehicle can be converted into the control model for controlling the position of the unmanned aerial vehicle.

If position control is directly adopted as a training target of reinforcement learning, the problem of overshoot can occur in the unmanned aerial vehicle control algorithm, and the overshoot is a considerable challenge to the safety problem of the unmanned aerial vehicle, so the scheme of combining unmanned aerial vehicle speed control and a PID model is adopted in the embodiment for improvement. For the PID control algorithm, the integral term causes overshoot of the drone, so here the integral term is deleted, leaving only the proportional and derivative terms. By reducing the proportion term, the integral term existing in the speed control of the unmanned aerial vehicle can be erased, and further the position control of the unmanned aerial vehicle is effectively realized.

Fig. 5 is a schematic diagram of a three-axis response curve controlled by the drone according to an embodiment of the present disclosure. In fig. 5, the abscissa is time t, and the ordinate is a response curve of three coordinate axes, as can be seen from fig. 5, the control method of this embodiment can implement position control of the unmanned aerial vehicle in about 10s, and the problem of overshoot does not occur in any of the three coordinate axes in the position control of the unmanned aerial vehicle.

In a second exemplary embodiment of the present disclosure, a zero overshoot drone position control device based on deep reinforcement learning is provided. The control device comprises a readable storage medium and one or more processors, wherein the readable storage medium is used for storing executable instructions; the one or more processors execute the control method according to the previous embodiment according to the executable instructions.

The following specifically describes the algorithm with an embodiment that applies the unmanned aerial vehicle control algorithm of the present disclosure to an unmanned aerial vehicle article placement (or delivery) scene.

Example one

In this embodiment, carry out position control to four rotor unmanned aerial vehicle, put (or article are delivered) the task and combine together with unmanned aerial vehicle position control algorithm through putting unmanned aerial vehicle article arrangement, realized putting of article according to preset position. Fig. 6 is a flowchart of a method for arranging and placing articles by using an unmanned aerial vehicle according to an embodiment of the present disclosure. As shown in fig. 6, the method for arranging and placing articles by using the unmanned aerial vehicle includes:

s101, acquiring the position of an article to be sorted and placed (or delivered);

s102, controlling the position of the unmanned aerial vehicle by adopting a reinforcement learning unmanned aerial vehicle position control algorithm, so that the unmanned aerial vehicle hovers above the current position of the articles to be sorted and placed;

s103, grabbing the articles to be arranged (or delivered) by the mechanical claws;

s104, controlling the position of the unmanned aerial vehicle by adopting a reinforcement learning unmanned aerial vehicle position control algorithm, so that the unmanned aerial vehicle hovers above the target position of the articles to be sorted and placed (or delivered);

and S105, releasing the mechanical claw to finish the arrangement and placement (or delivery) of the articles.

The rotating speeds of the four propellers of the unmanned aerial vehicle are controlled, and the rotating speeds are converted into the control of the position of the unmanned aerial vehicle by combining a PID algorithm, so that the object can be accurately placed (or delivered) at the target position.

So far, the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings. It is to be noted that, in the attached drawings or in the description, the implementation modes not shown or described are all the modes known by the ordinary skilled person in the field of technology, and are not described in detail. Further, the above definitions of the various elements and methods are not limited to the various specific structures, shapes or arrangements of parts mentioned in the examples, which may be easily modified or substituted by those of ordinary skill in the art.

And the shapes and sizes of the respective components in the drawings do not reflect actual sizes and proportions, but merely illustrate the contents of the embodiments of the present disclosure. Furthermore, in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim.

Furthermore, the word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements.

The use of ordinal numbers such as "first," "second," "third," etc., in the specification and claims to modify a corresponding element does not by itself connote any ordinal number of the element or any ordering of one element from another or the order of manufacture, and the use of the ordinal numbers is only used to distinguish one element having a certain name from another element having a same name.

In addition, unless steps are specifically described or must occur in sequence, the order of the steps is not limited to that listed above and may be changed or rearranged as desired by the desired design. The embodiments described above may be mixed and matched with each other or with other embodiments based on design and reliability considerations, i.e., technical features in different embodiments may be freely combined to form further embodiments.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, this disclosure is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the present disclosure as described herein, and any descriptions above of specific languages are provided for disclosure of enablement and best mode of the present disclosure.

The disclosure may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. Various component embodiments of the disclosure may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components in the relevant apparatus according to embodiments of the present disclosure. The present disclosure may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present disclosure may be stored on a computer-readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Also in the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.

The above-mentioned embodiments are intended to illustrate the objects, aspects and advantages of the present disclosure in further detail, and it should be understood that the above-mentioned embodiments are only illustrative of the present disclosure and are not intended to limit the present disclosure, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A zero overshoot unmanned aerial vehicle position control method based on deep reinforcement learning comprises the following steps:

s2, control unmanned aerial vehicle add PID control ring in unmanned aerial vehicle ' S the control model outside, carry out the optimum search to the PID parameter, utilize PID control algorithm to turn into unmanned aerial vehicle speed control ' S control model into unmanned aerial vehicle position control ' S control model to overshoot among the elimination position control.

2. The position control method of the unmanned aerial vehicle with zero overshoot according to claim 1, wherein in step S1, the control model of the unmanned aerial vehicle speed control is a markov model, and the observation states thereof are:

wherein the content of the first and second substances,

respectively the velocity in the x, y, z directions, phi_t，θ_t，ψ_tThe unmanned aerial vehicle posture expressed by Euler angle is adopted,

for unmanned aerial vehicle EulerAngle corresponds to angular velocity.

3. The method of claim 1, wherein the step S1 includes:

and alternately training the unmanned aerial vehicle control algorithm unit and the state evaluation unit of the reinforcement learning training frame, and fitting out the optimal mapping from the current state of the unmanned aerial vehicle to the unmanned aerial vehicle control signal.

4. The zero overshoot drone position control method of claim 3 wherein,

the unmanned aerial vehicle control algorithm unit maps the current state of the unmanned aerial vehicle into a control signal, and the control signal is an unmanned aerial vehicle speed control signal;

5. The zero overshoot drone position control method of claim 4 wherein,

the state evaluation unit evaluating the current state of the unmanned aerial vehicle comprises: and judging the difference between the current speed and the target control speed, wherein if the difference is larger or the time required for realizing the speed state transition is longer, the output value of the state evaluation function is smaller, otherwise, the output value is larger.

6. The zero overshoot drone position control method of claim 5 wherein said determining the difference between the current speed and the target control speed comprises:

and calculating the integral of the difference between the current speed of the unmanned aerial vehicle and the target speed of the unmanned aerial vehicle in a time range.

7. The zero overshoot drone position control method of claim 4 wherein the state evaluation unit employs a reward mechanism, the final control model converging to speed optimal control under the incentive of which the reward mechanism employs a reward function of:

wherein the content of the first and second substances,

a penalty function is constructed according to the speed difference of the unmanned aerial vehicle, and k is a proportionality coefficient between the speed difference and a penalty amount; r is_scalarA normal amount such that the reward is positive; the bonus term awards the control agent for speed control within the allowable error.

8. The position control method of the unmanned aerial vehicle with the zero overshoot according to claim 2, wherein the unmanned aerial vehicle control algorithm unit and the state evaluation unit adopt a feature extraction network structure, the feature extraction networks of the unmanned aerial vehicle control algorithm unit and the state evaluation unit each comprise an input layer, a first fully-hidden layer, a second fully-hidden layer and an output layer, the dimensions of the input layer, the first fully-hidden layer and the second fully-hidden layer are the same, and the dimension output by the unmanned aerial vehicle control algorithm unit is determined according to the control strategy parameters of the unmanned aerial vehicle power device.

9. The zero overshoot drone position control method of claim 1, the step S2 comprising:

Wherein, K_p，K_dFor proportional and derivative term coefficients of the drone, e_tFor the position of the unmanned aerial vehicle andthe difference between the positions of the targets,

is the target speed of the unmanned aerial vehicle;

obtaining the target speed by reinforcement learning

The mapping to the speed of the power device of the unmanned aerial vehicle turns the control model of the speed control of the unmanned aerial vehicle into the control model of the position control of the unmanned aerial vehicle, and the control model of the position control of the unmanned aerial vehicle is combined with the PID algorithm model, the integral term of the PID algorithm model is deleted, the proportional term and the differential term are reserved, and the position control of the zero overshoot of the unmanned aerial vehicle is realized.

10. The utility model provides a zero overshoot unmanned aerial vehicle position control device based on degree of depth reinforcement learning, includes:

a readable storage medium to store executable instructions;

one or more processors executing the control method of any one of claims 1-9 in accordance with the executable instructions.