CN112235810B

CN112235810B - Multi-dimensional optimization method and system of unmanned aerial vehicle communication system based on reinforcement learning

Info

Publication number: CN112235810B
Application number: CN202010991491.6A
Authority: CN
Inventors: 邓单
Original assignee: Guangzhou Panyu Polytechnic
Current assignee: Shanghai Gala Information Technology Co.,Ltd.
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2021-07-09
Anticipated expiration: 2040-09-17
Also published as: CN112235810A

Abstract

The invention discloses a multidimensional optimization method and a multidimensional optimization system of an unmanned aerial vehicle communication system based on reinforcement learning, wherein the method comprises the following steps: step S1, establishing a flight path and power distribution optimization problem model of the unmanned aerial vehicle communication system under the constraint of the minimum transmission rate; step S2, fixing the flight path, sorting the flight path of the established unmanned aerial vehicle communication system under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model, and solving by adopting a convex optimization method to obtain a power distribution factor; and step S3, optimizing the optimal flight path by adopting an iterative reinforcement learning method.

Description

Multi-dimensional optimization method and system of unmanned aerial vehicle communication system based on reinforcement learning

Technical Field

The invention relates to the technical field of unmanned aerial vehicle communication, in particular to a multidimensional optimization method and system of an unmanned aerial vehicle communication system based on reinforcement learning.

Background

A flight Trajectory and Precoding Joint Optimization method under a Non-orthogonal Multiple Access (NOMA) technology adopted in an Unmanned Aerial Vehicle (UAV) communication system is disclosed in detail in a journal paper published in IEEE Transactions on Communications in 2019, and the paper adopts an approximately convex Optimization method to convert a complex Non-convex problem into convex Optimization for solving so as to obtain an optimal flight Trajectory and a Precoding matrix.

However, the following disadvantages still exist in this method: the optimization method based on convex optimization needs to accurately model the system capacity, but in an actual communication system, because of the influence of channel distortion, disturbance and the like, the key parameters of the communication system are difficult to accurately describe, and therefore the use scene is limited.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide a multidimensional optimization method and system of an unmanned aerial vehicle communication system based on reinforcement learning.

In order to achieve the above and other objects, the present invention provides a multidimensional optimization method for an unmanned aerial vehicle communication system based on reinforcement learning, comprising the following steps:

step S1, establishing a flight path and power distribution optimization problem model of the unmanned aerial vehicle communication system under the constraint of the minimum transmission rate;

step S2, fixing the flight path, sorting the flight path of the established unmanned aerial vehicle communication system under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model, and solving by adopting a convex optimization method to obtain a power distribution factor;

and step S3, optimizing the optimal flight path by adopting an iterative reinforcement learning method.

Preferably, in step S1, the flight trajectory and power allocation optimization problem model under the constraint of the minimum transmission rate is represented as:

(P0)：

wherein r is_k[n]Indicating the safety capacity, ξ, of the kth user_i[n]Represents the power allocation factor, w n, for the ith user]For the position of the target user, vm is the maximum moving speed of the unmanned aerial vehicle, N represents the number of the divided time slots within a certain observation time T, and the interval between two adjacent time slots is represented as: delta is equal to T/N, and the ratio of T/N,R_k，k[n]indicating the capacity of the kth user.

Preferably, the step S2 further includes:

s200, assuming that the flight path is fixed, sorting the flight path under the constraint of the minimum transmission rate and a power distribution strategy optimization problem of a power distribution optimization problem model into a convex optimization problem about three variables;

step S201, solving the power distribution strategy optimization problem converted in the step S200 by adopting an iterative approximately convex optimization method to obtain a power distribution factor.

Preferably, in step S200, the target function r is obtained by_k[n]Conversion to convex function

Therefore, the power distribution strategy optimization problem of the flight path and power distribution optimization problem model under the constraint of the minimum transmission rate is consolidated into a convex optimization problem about three variables.

Preferably, in step S200, a first-order taylor expansion is adopted, and a relaxation variable is introduced, and the objective function is converted into a convex function, so that the power distribution strategy optimization problem of the flight trajectory and power distribution optimization problem model under the constraint of the minimum transmission rate is consolidated into a convex optimization problem P1 about three variables.

Preferably, in step S201, the solving process of the P1 problem includes:

step 1, obtaining an initial power distribution factor according to a minimum transmission rate requirement, and distributing all residual power to a strongest user; initialize iteration index r ═ 0, and calculate ξ^r[n]，I^r[n]，I^e，r[n]，η^r[n]；

Step 2, give η^r[n]Solving the P1 problem by using a standard convex optimization solving tool to obtain updated ξ^r+1[n]，I^r+1[n]，I^e，r+1[n]，η^r+1[n]And updating an iteration index r ═ r + 1;

step 3, if r reaches the maximum number of iterations or the increment of the objective function in the P1 problemLess than a predetermined threshold

The iteration stops; otherwise, repeating the step 2.

Preferably, the step S3 further includes:

step S300, carrying out grid segmentation on the horizontal target space of the flight path of the unmanned aerial vehicle, wherein the segmentation granularity is v_mδ*v_mDelta, converting different grids into a state space for reinforcement learning according to coordinates, and approximating a continuous action space of the unmanned aerial vehicle to a discrete action space consisting of five optional actions;

step S301, defining the safety capacity sum after the position of the unmanned aerial vehicle is updated as a reward function, and performing value function iterative updating;

step S302, after the unmanned aerial vehicle is subjected to reinforcement learning once, a new updating position is obtained, the updated power distribution factor is calculated by using the P1 solving method of the step S2, and the value function is updated in an iterative manner;

and S303, after the unmanned aerial vehicle is explored for a plurality of rounds, gradually approaching the value function to the optimal value function, and finally obtaining the optimal flight trajectory of the unmanned aerial vehicle.

Preferably, in step S301, the value function is iteratively updated according to the following iterative formula:

wherein Q is_n(s_n，a_n) Is a function of values and has an initial value of all zeros, R_nFor the reward function, θ is the learning rate factor and β is the discount factor.

In order to achieve the above object, the present invention further provides a multidimensional optimization system of an unmanned aerial vehicle communication system based on reinforcement learning, including:

the model building unit is used for building a flight trajectory and power distribution optimization problem model of the unmanned aerial vehicle communication system under the constraint of the minimum transmission rate;

the convex optimization solving unit is used for fixing a flight track, sorting the flight track of the established unmanned aerial vehicle communication system under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model, and solving by adopting a convex optimization method to obtain a power distribution factor;

and the reinforcement learning optimization unit is used for optimizing the optimal flight trajectory by adopting an iterative reinforcement learning method.

Preferably, the reinforcement learning optimization unit is specifically configured to:

firstly, carrying out grid segmentation on a horizontal target space of a flight trajectory of an Unmanned Aerial Vehicle (UAV), wherein the granularity of segmentation is v_mδ*v_mDelta, converting different grids into a state space for reinforcement learning according to coordinates, and approximating a continuous action space of the UAV to a discrete action space consisting of five optional actions;

defining the sum of the safe capacity after the position of the unmanned aerial vehicle UAV is updated as a reward function so as to carry out value function iteration updating;

after the unmanned aerial vehicle is subjected to reinforcement learning once, a new updating position is obtained, the updated power distribution factor is calculated by utilizing a P1 solving method of a convex optimization solving unit, and value function iteration updating is carried out;

after the unmanned aerial vehicle is explored for a plurality of rounds, the value function gradually approaches to the optimal value function, and finally the optimal unmanned aerial vehicle flight track is obtained.

Compared with the prior art, the unmanned aerial vehicle communication system multidimensional optimization method and system based on reinforcement learning, disclosed by the invention, have the advantages that the flight trajectory, the power distribution factor and other multidimensional joint optimization are carried out by adopting an optimization method combining convex optimization and reinforcement learning, the reward function can be obtained through feedback based on reinforcement learning, the communication system does not need to be accurately modeled, the application scene is wider, and the optimal multidimensional optimization result can be obtained.

Drawings

Fig. 1 is a flowchart illustrating steps of a multidimensional optimization method for an unmanned aerial vehicle communication system based on reinforcement learning according to the present invention;

FIG. 2 is a block diagram of a model of an Unmanned Aerial Vehicle (UAV) communications system to which the present invention is applied;

fig. 3 is a system architecture diagram of a multidimensional optimization system of an unmanned aerial vehicle communication system based on reinforcement learning according to the present invention;

FIG. 4 is a diagram of a flight trajectory of an unmanned aerial vehicle based on reinforcement learning according to an embodiment of the present invention;

fig. 5 shows the sum of flight path safety capacities of the unmanned aerial vehicle according to the embodiment of the invention.

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

Fig. 1 is a flowchart illustrating steps of a multidimensional optimization method for an unmanned aerial vehicle communication system based on reinforcement learning according to the present invention. The invention relates to a multidimensional optimization method of an unmanned aerial vehicle communication system based on reinforcement learning, which comprises the following steps:

and step S1, establishing a flight trajectory and power distribution optimization problem model of the unmanned aerial vehicle communication system under the constraint of the minimum transmission rate.

An Unmanned Aerial Vehicle (UAV) communication system model applied by the present invention is shown in fig. 2, in which a UAV communication base station, K target users, and an eavesdropping user are provided. Where the UAV base station is free to move within a horizontal target area of height H, the target user's location is represented as: l is_i＝[x_i，y_i]^T，i∈[1，K]The location of the eavesdropping user is expressed as: l is^e. The flight trajectory of the UAV base station at different time points may be expressed as:

W＝{w[n]＝[x[n]，y[n]]^Tn ═ 1, 2, ·, N. } (formula one)

Wherein, w [ n ]]Horizontal coordinates representing the nth observation time point, N representing division within a certain observation time TAnd the interval between two adjacent slots can be expressed as: delta T/N and UAV maximum movement rate v_mAt this time, the channel fading power from the ith user to the UAV base station may be expressed as:

wherein d is_iRepresenting the distance of the ith user from the UAV.

ρ_oThe reference signal power gain at unit distance is shown, alpha is more than or equal to 2 to represent the channel path loss index, and the value range is generally between 2 and 4. Similarly, the channel fading power of the eavesdropping user to the UAV base station is:

d_eindicating the distance of an eavesdropping user from the UAV

Assuming that the UAV communication system employs a non-orthogonal multiple access (NOMA) communication protocol, its downlink transmission signal can be expressed as:

where P represents the total transmit power of the UAV base station, x_iIs the data symbol of the ith user, ξ_iRepresenting the power allocation factor of the ith user.

In view of power constraints, have

Ω represents a set of all users, and Ω is [1, K ]

Then the received signal of the ith user is:

wherein n is_iRepresenting the received noise of the target user, with a power of σ²，g_iChannel fading power g from the ith user to the UAV base station_i[n]. According to the NOMA receiver successive interference cancellation algorithm, the signal-to-noise ratio of the kth data stream symbol in the ith user can be expressed as:

in the above formula, the first and second carbon atoms are,

representing the disturbing part, ξ, of the kth data stream_iRepresenting the power allocation factor of the ith user. At this time, the capacity of the kth user can be expressed as:

it is assumed that each user is a minimum transmission rate constraint, i.e.:

indicating the minimum transmission rate requirement for the kth user.

Similarly, for an eavesdropping user, the capacity of the kth data stream can be expressed as:

wherein:

according to the definition of the safe capacity, the safe capacity of the kth user is shown as follows:

in summary, the flight trajectory and power allocation optimization problem model under the constraint of the minimum transmission rate can be expressed as:

(P0)：

w represents the optimal flight path, and ζ represents the optimal power distribution factor

And S2, fixing the flight path, sorting the flight path of the established unmanned aerial vehicle communication system under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model, and solving by adopting a convex optimization method to obtain a power distribution factor.

Because the flight path and power distribution optimization problem model (formula nine) under the minimum transmission rate constraint established in the step S1 is a non-convex function and is difficult to directly solve, the optimization problem is solved by adopting a method combining approximate convex optimization and reinforcement learning, namely, the flight path and power distribution factor and other dimensions are jointly optimized by adopting an optimization method combining convex optimization and reinforcement learning.

Specifically, step S2 further includes:

and S200, assuming that the flight path is fixed, and sorting the flight path under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model into a convex optimization problem about three variables.

Firstly, assuming that the flight trajectory is fixed, the power distribution strategy optimization problem is solved.

Considering the minimum transmission rate equation six may translate to:

the objective function for the P0 problem can be approximated as:

wherein the content of the first and second substances,

at this time, the objective function is still notIs a convex function, therefore, the present invention employs a first order Taylor expansion and introduces a relaxation variable

The objective function is further converted into:

wherein the content of the first and second substances,

the result of the solution of the r-th time is obtained.

At this time, the power allocation policy optimization problem can be collated as:

(P1)：

thus, the P1 problem is a convex optimization problem with respect to three variables.

In a specific embodiment of the invention, a standard convex optimization solver, such as CVX, is used to perform the numerical solution. Specifically, the flow of the solving method of the P1 problem is as follows:

step 1, initialization: obtaining an initial power distribution factor according to the requirement of the minimum transmission rate, and distributing all residual power to the strongest user; initializing the iteration index r as 0 and calculating

ξ^r[n]，I^r[n]，I^e，r[n]，η^r[n]；

Step 2, give η^r[n]Solving the P1 problem by using a CVX tool to obtain updated ξ^r+1[n]，I^r+1[n]，I^e，r+1[n]，η^r+1[n]And updating an iteration index r ═ r + 1;

step 3, if r reaches the maximum iteration number or the increment of the objective function in the P1 problem is smaller than the preset threshold

The iteration stops; otherwise, repeating the step 2.

After the power distribution factor is obtained, the flight trajectory needs to be optimized continuously. The invention adopts an optimization method based on reinforcement learning to solve. Specifically, step S3 further includes:

step S300, carrying out grid segmentation on the horizontal target space of the flight trajectory of the unmanned aerial vehicle UAV, wherein the segmentation granularity is v_mδ*v_mDelta, and converting different grids into a state space s for reinforcement learning according to coordinates_n；

Step S301, connecting the unmanned aerial vehicle UAVThe motion space is approximated to be a discrete motion space a composed of five optional motions_n。

Step S302, defining the sum of the safety capacities after the position update of the unmanned aerial vehicle UAV as a reward function, and performing value function iterative update by using the following iterative formula:

wherein Q is_n(s_n，a_n) Is a function of values and has an initial value of all zeros, R_nFor the reward function, θ is the learning rate factor and β is the discount factor. s_nRepresenting the state of the point at time n, i.e. horizontal coordinate, a_nThe action taken by the UAV at time point n is shown.

And step S303, obtaining a new updating position after each reinforcement learning of the unmanned aerial vehicle UAV, wherein the reinforcement learning updating method adopts a probability greedy algorithm, namely, the optimal action in the current value function is selected with a certain probability, and the rest probability is averagely distributed to all other non-optimal actions. Calculating the updated power distribution factor by using the P1 solving method of the step S2, iteratively updating through the iterative formula, judging whether the value function approaches the optimal value function, if not, continuously obtaining an updated position, calculating the updated power distribution factor, and continuously iterating until the value function approaches the optimal value function in the step S304;

and S304, after the unmanned aerial vehicle UAV is explored for a plurality of rounds, gradually approaching the value function to the optimal value function, and finally obtaining the optimal UAV flight track.

Fig. 3 is a system architecture diagram of a multidimensional optimization system of an unmanned aerial vehicle communication system based on reinforcement learning according to the present invention. The invention relates to a multidimensional optimization system of an unmanned aerial vehicle communication system based on reinforcement learning, which comprises the following components:

the model building unit 301 is configured to build a flight trajectory and power distribution optimization problem model of the unmanned aerial vehicle communication system under the constraint of the minimum transmission rate.

Unmanned aerial vehicle applied by the inventionThe (UAV) communication system model is shown in fig. 2, where there is a UAV communication base station, K target users and an eavesdropping user. Where the UAV base station is free to move within a horizontal target area of height H, the target user's location is represented as: l is_i＝[x_i，y_i]^T，i∈[1，K]The location of the eavesdropping user is expressed as: l is^e. The flight trajectory of the UAV base station at different time points may be expressed as:

W＝{w[n]＝[x[n]，y[n]]^T，n＝1，2，...，N.}

where N represents the number of slots divided within a certain observation time T, and the interval between two adjacent slots can be represented as: delta T/N and UAV maximum movement rate v_mAt this time, the channel fading power from the ith user to the UAV base station may be expressed as:

where ρ o is the power gain of the reference signal at unit distance, and α ≧ 2 represents the channel path loss exponent. Similarly, the channel fading power of the eavesdropping user to the UAV base station is:

In view of power constraints, have

Then the received signal of the ith user is:

wherein n is_iRepresenting the received noise of the target user. According to the NOMA receiver successive interference cancellation algorithm, the signal-to-noise ratio of the kth data stream symbol in the ith user can be expressed as:

in the above formula, the first and second carbon atoms are,

representing the interfering part of the k-th data stream. At this time, the capacity of the kth user can be expressed as:

it is assumed that each user is a minimum transmission rate constraint, i.e.:

wherein:

(P0)：

and the convex optimization solving unit 302 is configured to fix the flight trajectory, sort the power distribution strategy optimization problem of the flight trajectory and power distribution optimization problem model of the established unmanned aerial vehicle communication system under the constraint of the minimum transmission rate, and solve by using a convex optimization method to obtain the power distribution factor.

Because the flight trajectory and power distribution optimization problem model (formula nine) under the minimum transmission rate constraint established by the model establishing unit 301 is a non-convex function and is difficult to directly solve, the optimization problem is solved by adopting a method combining approximate convex optimization and reinforcement learning, namely, the flight trajectory and power distribution factor and other dimensions are jointly optimized by adopting an optimization method combining convex optimization and reinforcement learning.

Specifically, the convex optimization solving unit 302 further includes:

and the model conversion module is used for assuming that the flight path is fixed and organizing the flight path under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model into a convex optimization problem about three variables.

Considering the minimum transmission rate, this can be translated into:

the objective function for the P0 problem can be approximated as:

wherein the content of the first and second substances,

at this time, the objective function is still not a convex function, therefore, the invention adopts first-order Taylor expansion, introduces a relaxation variable, and further converts the objective function into:

wherein the content of the first and second substances,

the result of the solution of the r-th time is obtained.

(P1)：

And the convex optimization solving module is used for solving the power distribution strategy optimization problem converted by the model conversion module by adopting an iterative approximate convex optimization method to obtain a power distribution factor.

ξ^r[n]，I^r[n]，I^e，r[n]，η^r[n]；

The iteration stops; otherwise, repeating the step 2.

And the reinforcement learning optimization unit 303 is configured to optimize the optimal flight trajectory by using an iterative reinforcement learning method.

After the power distribution factor is obtained, the flight trajectory needs to be optimized continuously. The reinforcement learning optimization unit 303 of the present invention performs solution by using an optimization method based on reinforcement learning. The reinforcement learning optimization unit 303 is specifically configured to:

firstly, carrying out grid segmentation on a horizontal target space of a flight trajectory of an Unmanned Aerial Vehicle (UAV), wherein the granularity of segmentation is v_mδ*v_mAnd delta, converting different grids into a state space for reinforcement learning according to coordinates, and approximating a continuous motion space of the unmanned aerial vehicle UAV to a discrete motion space formed by front, back, left and right and five optional motions.

Defining the sum of safe capacities after the update of the unmanned aerial vehicle UAV position as a reward function, and employing the following iterative formula to perform value function iterative update:

Acquiring a new updating position after each reinforcement learning of the unmanned aerial vehicle UAV, wherein the reinforcement learning updating method adopts a probability greedy algorithm, namely selecting the optimal action in a current value function with a certain probability, averagely distributing the optimal action to all other non-optimal actions with a residual probability, and calculating an updated power distribution factor by using a P1 solving method of the convex optimization solving unit 302; and (4) performing iterative updating through the iterative formula, judging whether the value function approaches to the optimal value function, if not, continuously obtaining an updated position, calculating the updated power distribution factor, and continuously iterating until the value function approaches to the optimal value function.

After the unmanned aerial vehicle UAV is explored through a plurality of rounds, the value function gradually approaches to the optimal value function, and finally the optimal UAV flight track is obtained.

Examples

In this embodiment, it is assumed that 9 target users are distributed on a diagonal line at 45 degrees, and the coordinates (100 ) of the intercepted users are used to obtain a flight path as shown in fig. 2, and a corresponding safe capacity as shown in fig. 3. As can be seen from fig. 4 and 5, the multidimensional optimization method based on reinforcement learning can reach a substantially steady state after about 2000 times of exploration, and can maintain the optimal capacity sum with a high probability.

In summary, the invention provides a multidimensional optimization method and system for an unmanned aerial vehicle communication system based on reinforcement learning, which jointly optimizes multiple dimensions such as flight trajectories and power distribution factors by adopting an optimization method combining convex optimization and reinforcement learning.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. A multidimensional optimization method of an unmanned aerial vehicle communication system based on reinforcement learning is characterized by comprising the following steps:

the flight trajectory and power distribution optimization problem model under the constraint of the minimum transmission rate is as follows:

wherein r is_k[n]Indicating the safety capacity, ξ, of the kth user_i[n]Represents the power allocation factor, w n, for the ith user]Is the location of the target user, v_mFor the maximum moving rate of the unmanned aerial vehicle, N represents the number of time slots divided within a certain observation time T, and the interval between two adjacent time slots is represented as: delta as T/N, R_k，k[n]Indicating the capacity of the k-th user,k denotes the target user, R_k ^thRepresents the minimum transmission rate requirement for the kth user;

step S200, assuming that the flight path is fixed, the flight path under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model are sorted into a convex optimization problem about three variables:

(P1)：

the P1 problem is a convex optimization problem with respect to three variables;

step S201, solving the power distribution strategy optimization problem converted in the step S200 by adopting an iterative approximately convex optimization method to obtain a power distribution factor;

2. The method of claim 1, wherein in step S200, the objective function r is determined by performing a multi-dimensional optimization of the communication system of the drone based on reinforcement learning_k[n]Conversion to convex function r_k ^lb[n]Therefore, the power distribution strategy optimization problem of the flight path and power distribution optimization problem model under the constraint of the minimum transmission rate is consolidated into a convex optimization problem about three variables.

3. The method as claimed in claim 2, wherein in step S200, a first-order taylor expansion is adopted, and a relaxation variable is introduced to convert an objective function into a convex function, so as to solve the power distribution strategy optimization problem of the flight trajectory and power distribution optimization problem model under the constraint of the minimum transmission rate into a convex optimization problem P1 with respect to three variables.

4. The method as claimed in claim 3, wherein in step S201, the process of solving the P1 problem includes:

step 3, if r reaches the maximum iteration number or the increment of the target function in the P1 problem is smaller than a preset threshold tau, the iteration is stopped; otherwise, repeating the step 2.

5. The method for multi-dimensional optimization of unmanned aerial vehicle communication system based on reinforcement learning of claim 4, wherein step S3 further comprises:

6. The method for multidimensional optimization of unmanned aerial vehicle communication system based on reinforcement learning of claim 5, wherein in step S301, the value function is iteratively updated according to the following iterative formula:

7. A multidimensional optimization system of an unmanned aerial vehicle communication system based on reinforcement learning is characterized by comprising:

(P0)：

wherein r is_k[n]Indicating the safety capacity, ξ, of the kth user_i[n]Represents the power allocation factor, w n, for the ith user]Is the location of the target user, v_mFor the maximum moving rate of the unmanned aerial vehicle, N represents the number of time slots divided within a certain observation time T, and the interval between two adjacent time slots is represented as: delta as T/N, R_k，k[n]Denotes the capacity of the kth user, K denotes the target user, R_k ^thRepresents the minimum transmission rate requirement for the kth user;

assuming that the flight path is fixed, the flight path under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model are sorted into a convex optimization problem about three variables:

(P1)：

solving the power distribution strategy optimization problem converted in the step S200 by adopting an iterative approximately convex optimization method to obtain a power distribution factor;

8. The system of claim 7, wherein the reinforcement learning optimization unit is specifically configured to:

firstly, carrying out grid segmentation on a horizontal target space of a flight trajectory of an Unmanned Aerial Vehicle (UAV), wherein the granularity of segmentation is upsilon_mδ*υ_mDelta, converting different grids into a state space for reinforcement learning according to coordinates, and approximating a continuous action space of the UAV to a discrete action space consisting of five optional actions;

after the unmanned aerial vehicle is explored for a plurality of rounds, the value function gradually approaches to the optimal value function, and finally the optimal unmanned aerial vehicle flight track is obtained;

the solving process of the P1 problem comprises the following steps: