Disclosure of Invention
In order to overcome the defects in the prior art, the invention aims to provide a multidimensional optimization method and system of an unmanned aerial vehicle communication system based on reinforcement learning.
In order to achieve the above and other objects, the present invention provides a multidimensional optimization method for an unmanned aerial vehicle communication system based on reinforcement learning, comprising the following steps:
step S1, establishing a flight path and power distribution optimization problem model of the unmanned aerial vehicle communication system under the constraint of the minimum transmission rate;
step S2, fixing the flight path, sorting the flight path of the established unmanned aerial vehicle communication system under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model, and solving by adopting a convex optimization method to obtain a power distribution factor;
and step S3, optimizing the optimal flight path by adopting an iterative reinforcement learning method.
Preferably, in step S1, the flight trajectory and power allocation optimization problem model under the constraint of the minimum transmission rate is represented as:
wherein r isk[n]Indicating the safety capacity, ξ, of the kth useri[n]Represents the power allocation factor, w n, for the ith user]For the position of the target user, vm is the maximum moving speed of the unmanned aerial vehicle, N represents the number of the divided time slots within a certain observation time T, and the interval between two adjacent time slots is represented as: delta is equal to T/N, and the ratio of T/N,Rk,k[n]indicating the capacity of the kth user.
Preferably, the step S2 further includes:
s200, assuming that the flight path is fixed, sorting the flight path under the constraint of the minimum transmission rate and a power distribution strategy optimization problem of a power distribution optimization problem model into a convex optimization problem about three variables;
step S201, solving the power distribution strategy optimization problem converted in the step S200 by adopting an iterative approximately convex optimization method to obtain a power distribution factor.
Preferably, in step S200, the target function r is obtained by
k[n]Conversion to convex function
Therefore, the power distribution strategy optimization problem of the flight path and power distribution optimization problem model under the constraint of the minimum transmission rate is consolidated into a convex optimization problem about three variables.
Preferably, in step S200, a first-order taylor expansion is adopted, and a relaxation variable is introduced, and the objective function is converted into a convex function, so that the power distribution strategy optimization problem of the flight trajectory and power distribution optimization problem model under the constraint of the minimum transmission rate is consolidated into a convex optimization problem P1 about three variables.
Preferably, in step S201, the solving process of the P1 problem includes:
step 1, obtaining an initial power distribution factor according to a minimum transmission rate requirement, and distributing all residual power to a strongest user; initialize iteration index r ═ 0, and calculate ξr[n],Ir[n],Ie,r[n],ηr[n];
Step 2, give ηr[n]Solving the P1 problem by using a standard convex optimization solving tool to obtain updated ξr+1[n],Ir+1[n],Ie,r+1[n],ηr+1[n]And updating an iteration index r ═ r + 1;
step 3, if r reaches the maximum number of iterations or the increment of the objective function in the P1 problemLess than a predetermined threshold
The iteration stops; otherwise, repeating the step 2.
Preferably, the step S3 further includes:
step S300, carrying out grid segmentation on the horizontal target space of the flight path of the unmanned aerial vehicle, wherein the segmentation granularity is vmδ*vmDelta, converting different grids into a state space for reinforcement learning according to coordinates, and approximating a continuous action space of the unmanned aerial vehicle to a discrete action space consisting of five optional actions;
step S301, defining the safety capacity sum after the position of the unmanned aerial vehicle is updated as a reward function, and performing value function iterative updating;
step S302, after the unmanned aerial vehicle is subjected to reinforcement learning once, a new updating position is obtained, the updated power distribution factor is calculated by using the P1 solving method of the step S2, and the value function is updated in an iterative manner;
and S303, after the unmanned aerial vehicle is explored for a plurality of rounds, gradually approaching the value function to the optimal value function, and finally obtaining the optimal flight trajectory of the unmanned aerial vehicle.
Preferably, in step S301, the value function is iteratively updated according to the following iterative formula:
wherein Q isn(sn,an) Is a function of values and has an initial value of all zeros, RnFor the reward function, θ is the learning rate factor and β is the discount factor.
In order to achieve the above object, the present invention further provides a multidimensional optimization system of an unmanned aerial vehicle communication system based on reinforcement learning, including:
the model building unit is used for building a flight trajectory and power distribution optimization problem model of the unmanned aerial vehicle communication system under the constraint of the minimum transmission rate;
the convex optimization solving unit is used for fixing a flight track, sorting the flight track of the established unmanned aerial vehicle communication system under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model, and solving by adopting a convex optimization method to obtain a power distribution factor;
and the reinforcement learning optimization unit is used for optimizing the optimal flight trajectory by adopting an iterative reinforcement learning method.
Preferably, the reinforcement learning optimization unit is specifically configured to:
firstly, carrying out grid segmentation on a horizontal target space of a flight trajectory of an Unmanned Aerial Vehicle (UAV), wherein the granularity of segmentation is vmδ*vmDelta, converting different grids into a state space for reinforcement learning according to coordinates, and approximating a continuous action space of the UAV to a discrete action space consisting of five optional actions;
defining the sum of the safe capacity after the position of the unmanned aerial vehicle UAV is updated as a reward function so as to carry out value function iteration updating;
after the unmanned aerial vehicle is subjected to reinforcement learning once, a new updating position is obtained, the updated power distribution factor is calculated by utilizing a P1 solving method of a convex optimization solving unit, and value function iteration updating is carried out;
after the unmanned aerial vehicle is explored for a plurality of rounds, the value function gradually approaches to the optimal value function, and finally the optimal unmanned aerial vehicle flight track is obtained.
Compared with the prior art, the unmanned aerial vehicle communication system multidimensional optimization method and system based on reinforcement learning, disclosed by the invention, have the advantages that the flight trajectory, the power distribution factor and other multidimensional joint optimization are carried out by adopting an optimization method combining convex optimization and reinforcement learning, the reward function can be obtained through feedback based on reinforcement learning, the communication system does not need to be accurately modeled, the application scene is wider, and the optimal multidimensional optimization result can be obtained.
Detailed Description
Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.
Fig. 1 is a flowchart illustrating steps of a multidimensional optimization method for an unmanned aerial vehicle communication system based on reinforcement learning according to the present invention. The invention relates to a multidimensional optimization method of an unmanned aerial vehicle communication system based on reinforcement learning, which comprises the following steps:
and step S1, establishing a flight trajectory and power distribution optimization problem model of the unmanned aerial vehicle communication system under the constraint of the minimum transmission rate.
An Unmanned Aerial Vehicle (UAV) communication system model applied by the present invention is shown in fig. 2, in which a UAV communication base station, K target users, and an eavesdropping user are provided. Where the UAV base station is free to move within a horizontal target area of height H, the target user's location is represented as: l isi=[xi,yi]T,i∈[1,K]The location of the eavesdropping user is expressed as: l ise. The flight trajectory of the UAV base station at different time points may be expressed as:
W={w[n]=[x[n],y[n]]Tn ═ 1, 2, ·, N. } (formula one)
Wherein, w [ n ]]Horizontal coordinates representing the nth observation time point, N representing division within a certain observation time TAnd the interval between two adjacent slots can be expressed as: delta T/N and UAV maximum movement rate vmAt this time, the channel fading power from the ith user to the UAV base station may be expressed as:
wherein d isiRepresenting the distance of the ith user from the UAV.
ρoThe reference signal power gain at unit distance is shown, alpha is more than or equal to 2 to represent the channel path loss index, and the value range is generally between 2 and 4. Similarly, the channel fading power of the eavesdropping user to the UAV base station is:
deindicating the distance of an eavesdropping user from the UAV
Assuming that the UAV communication system employs a non-orthogonal multiple access (NOMA) communication protocol, its downlink transmission signal can be expressed as:
where P represents the total transmit power of the UAV base station, xiIs the data symbol of the ith user, ξiRepresenting the power allocation factor of the ith user.
In view of power constraints, have
Ω represents a set of all users, and Ω is [1, K ]
Then the received signal of the ith user is:
wherein n isiRepresenting the received noise of the target user, with a power of σ2,giChannel fading power g from the ith user to the UAV base stationi[n]. According to the NOMA receiver successive interference cancellation algorithm, the signal-to-noise ratio of the kth data stream symbol in the ith user can be expressed as:
in the above formula, the first and second carbon atoms are,
representing the disturbing part, ξ, of the kth data stream
iRepresenting the power allocation factor of the ith user. At this time, the capacity of the kth user can be expressed as:
it is assumed that each user is a minimum transmission rate constraint, i.e.:
indicating the minimum transmission rate requirement for the kth user.
Similarly, for an eavesdropping user, the capacity of the kth data stream can be expressed as:
wherein:
according to the definition of the safe capacity, the safe capacity of the kth user is shown as follows:
in summary, the flight trajectory and power allocation optimization problem model under the constraint of the minimum transmission rate can be expressed as:
w represents the optimal flight path, and ζ represents the optimal power distribution factor
And S2, fixing the flight path, sorting the flight path of the established unmanned aerial vehicle communication system under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model, and solving by adopting a convex optimization method to obtain a power distribution factor.
Because the flight path and power distribution optimization problem model (formula nine) under the minimum transmission rate constraint established in the step S1 is a non-convex function and is difficult to directly solve, the optimization problem is solved by adopting a method combining approximate convex optimization and reinforcement learning, namely, the flight path and power distribution factor and other dimensions are jointly optimized by adopting an optimization method combining convex optimization and reinforcement learning.
Specifically, step S2 further includes:
and S200, assuming that the flight path is fixed, and sorting the flight path under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model into a convex optimization problem about three variables.
Firstly, assuming that the flight trajectory is fixed, the power distribution strategy optimization problem is solved.
Considering the minimum transmission rate equation six may translate to:
the objective function for the P0 problem can be approximated as:
wherein the content of the first and second substances,
at this time, the objective function is still notIs a convex function, therefore, the present invention employs a first order Taylor expansion and introduces a relaxation variable
The objective function is further converted into:
wherein the content of the first and second substances,
the result of the solution of the r-th time is obtained.
At this time, the power allocation policy optimization problem can be collated as:
thus, the P1 problem is a convex optimization problem with respect to three variables.
Step S201, solving the power distribution strategy optimization problem converted in the step S200 by adopting an iterative approximately convex optimization method to obtain a power distribution factor.
In a specific embodiment of the invention, a standard convex optimization solver, such as CVX, is used to perform the numerical solution. Specifically, the flow of the solving method of the P1 problem is as follows:
step 1, initialization: obtaining an initial power distribution factor according to the requirement of the minimum transmission rate, and distributing all residual power to the strongest user; initializing the iteration index r as 0 and calculating
ξr[n],Ir[n],Ie,r[n],ηr[n];
Step 2, give ηr[n]Solving the P1 problem by using a CVX tool to obtain updated ξr+1[n],Ir+1[n],Ie,r+1[n],ηr+1[n]And updating an iteration index r ═ r + 1;
step 3, if r reaches the maximum iteration number or the increment of the objective function in the P1 problem is smaller than the preset threshold
The iteration stops; otherwise, repeating the step 2.
And step S3, optimizing the optimal flight path by adopting an iterative reinforcement learning method.
After the power distribution factor is obtained, the flight trajectory needs to be optimized continuously. The invention adopts an optimization method based on reinforcement learning to solve. Specifically, step S3 further includes:
step S300, carrying out grid segmentation on the horizontal target space of the flight trajectory of the unmanned aerial vehicle UAV, wherein the segmentation granularity is vmδ*vmDelta, and converting different grids into a state space s for reinforcement learning according to coordinatesn;
Step S301, connecting the unmanned aerial vehicle UAVThe motion space is approximated to be a discrete motion space a composed of five optional motionsn。
Step S302, defining the sum of the safety capacities after the position update of the unmanned aerial vehicle UAV as a reward function, and performing value function iterative update by using the following iterative formula:
wherein Q isn(sn,an) Is a function of values and has an initial value of all zeros, RnFor the reward function, θ is the learning rate factor and β is the discount factor. snRepresenting the state of the point at time n, i.e. horizontal coordinate, anThe action taken by the UAV at time point n is shown.
And step S303, obtaining a new updating position after each reinforcement learning of the unmanned aerial vehicle UAV, wherein the reinforcement learning updating method adopts a probability greedy algorithm, namely, the optimal action in the current value function is selected with a certain probability, and the rest probability is averagely distributed to all other non-optimal actions. Calculating the updated power distribution factor by using the P1 solving method of the step S2, iteratively updating through the iterative formula, judging whether the value function approaches the optimal value function, if not, continuously obtaining an updated position, calculating the updated power distribution factor, and continuously iterating until the value function approaches the optimal value function in the step S304;
and S304, after the unmanned aerial vehicle UAV is explored for a plurality of rounds, gradually approaching the value function to the optimal value function, and finally obtaining the optimal UAV flight track.
Fig. 3 is a system architecture diagram of a multidimensional optimization system of an unmanned aerial vehicle communication system based on reinforcement learning according to the present invention. The invention relates to a multidimensional optimization system of an unmanned aerial vehicle communication system based on reinforcement learning, which comprises the following components:
the model building unit 301 is configured to build a flight trajectory and power distribution optimization problem model of the unmanned aerial vehicle communication system under the constraint of the minimum transmission rate.
Unmanned aerial vehicle applied by the inventionThe (UAV) communication system model is shown in fig. 2, where there is a UAV communication base station, K target users and an eavesdropping user. Where the UAV base station is free to move within a horizontal target area of height H, the target user's location is represented as: l isi=[xi,yi]T,i∈[1,K]The location of the eavesdropping user is expressed as: l ise. The flight trajectory of the UAV base station at different time points may be expressed as:
W={w[n]=[x[n],y[n]]T,n=1,2,...,N.}
where N represents the number of slots divided within a certain observation time T, and the interval between two adjacent slots can be represented as: delta T/N and UAV maximum movement rate vmAt this time, the channel fading power from the ith user to the UAV base station may be expressed as:
where ρ o is the power gain of the reference signal at unit distance, and α ≧ 2 represents the channel path loss exponent. Similarly, the channel fading power of the eavesdropping user to the UAV base station is:
assuming that the UAV communication system employs a non-orthogonal multiple access (NOMA) communication protocol, its downlink transmission signal can be expressed as:
where P represents the total transmit power of the UAV base station, xiIs the data symbol of the ith user, ξiRepresenting the power allocation factor of the ith user.
In view of power constraints, have
Then the received signal of the ith user is:
wherein n isiRepresenting the received noise of the target user. According to the NOMA receiver successive interference cancellation algorithm, the signal-to-noise ratio of the kth data stream symbol in the ith user can be expressed as:
in the above formula, the first and second carbon atoms are,
representing the interfering part of the k-th data stream. At this time, the capacity of the kth user can be expressed as:
it is assumed that each user is a minimum transmission rate constraint, i.e.:
similarly, for an eavesdropping user, the capacity of the kth data stream can be expressed as:
wherein:
according to the definition of the safe capacity, the safe capacity of the kth user is shown as follows:
in summary, the flight trajectory and power allocation optimization problem model under the constraint of the minimum transmission rate can be expressed as:
and the convex optimization solving unit 302 is configured to fix the flight trajectory, sort the power distribution strategy optimization problem of the flight trajectory and power distribution optimization problem model of the established unmanned aerial vehicle communication system under the constraint of the minimum transmission rate, and solve by using a convex optimization method to obtain the power distribution factor.
Because the flight trajectory and power distribution optimization problem model (formula nine) under the minimum transmission rate constraint established by the model establishing unit 301 is a non-convex function and is difficult to directly solve, the optimization problem is solved by adopting a method combining approximate convex optimization and reinforcement learning, namely, the flight trajectory and power distribution factor and other dimensions are jointly optimized by adopting an optimization method combining convex optimization and reinforcement learning.
Specifically, the convex optimization solving unit 302 further includes:
and the model conversion module is used for assuming that the flight path is fixed and organizing the flight path under the constraint of the minimum transmission rate and the power distribution strategy optimization problem of the power distribution optimization problem model into a convex optimization problem about three variables.
Firstly, assuming that the flight trajectory is fixed, the power distribution strategy optimization problem is solved.
Considering the minimum transmission rate, this can be translated into:
the objective function for the P0 problem can be approximated as:
wherein the content of the first and second substances,
at this time, the objective function is still not a convex function, therefore, the invention adopts first-order Taylor expansion, introduces a relaxation variable, and further converts the objective function into:
wherein the content of the first and second substances,
the result of the solution of the r-th time is obtained.
At this time, the power allocation policy optimization problem can be collated as:
thus, the P1 problem is a convex optimization problem with respect to three variables.
And the convex optimization solving module is used for solving the power distribution strategy optimization problem converted by the model conversion module by adopting an iterative approximate convex optimization method to obtain a power distribution factor.
In a specific embodiment of the invention, a standard convex optimization solver, such as CVX, is used to perform the numerical solution. Specifically, the flow of the solving method of the P1 problem is as follows:
step 1, initialization: obtaining an initial power distribution factor according to the requirement of the minimum transmission rate, and distributing all residual power to the strongest user; initializing the iteration index r as 0 and calculating
ξr[n],Ir[n],Ie,r[n],ηr[n];
Step 2, give ηr[n]Solving the P1 problem by using a CVX tool to obtain updated ξr+1[n],Ir+1[n],Ie,r+1[n],ηr+1[n]And updating an iteration index r ═ r + 1;
step 3, if r reaches the maximum iteration number or the increment of the objective function in the P1 problem is smaller than the preset threshold
The iteration stops; otherwise, repeating the step 2.
And the reinforcement learning optimization unit 303 is configured to optimize the optimal flight trajectory by using an iterative reinforcement learning method.
After the power distribution factor is obtained, the flight trajectory needs to be optimized continuously. The reinforcement learning optimization unit 303 of the present invention performs solution by using an optimization method based on reinforcement learning. The reinforcement learning optimization unit 303 is specifically configured to:
firstly, carrying out grid segmentation on a horizontal target space of a flight trajectory of an Unmanned Aerial Vehicle (UAV), wherein the granularity of segmentation is vmδ*vmAnd delta, converting different grids into a state space for reinforcement learning according to coordinates, and approximating a continuous motion space of the unmanned aerial vehicle UAV to a discrete motion space formed by front, back, left and right and five optional motions.
Defining the sum of safe capacities after the update of the unmanned aerial vehicle UAV position as a reward function, and employing the following iterative formula to perform value function iterative update:
wherein Q isn(sn,an) Is a function of values and has an initial value of all zeros, RnFor the reward function, θ is the learning rate factor and β is the discount factor.
Acquiring a new updating position after each reinforcement learning of the unmanned aerial vehicle UAV, wherein the reinforcement learning updating method adopts a probability greedy algorithm, namely selecting the optimal action in a current value function with a certain probability, averagely distributing the optimal action to all other non-optimal actions with a residual probability, and calculating an updated power distribution factor by using a P1 solving method of the convex optimization solving unit 302; and (4) performing iterative updating through the iterative formula, judging whether the value function approaches to the optimal value function, if not, continuously obtaining an updated position, calculating the updated power distribution factor, and continuously iterating until the value function approaches to the optimal value function.
After the unmanned aerial vehicle UAV is explored through a plurality of rounds, the value function gradually approaches to the optimal value function, and finally the optimal UAV flight track is obtained.
Examples
In this embodiment, it is assumed that 9 target users are distributed on a diagonal line at 45 degrees, and the coordinates (100 ) of the intercepted users are used to obtain a flight path as shown in fig. 2, and a corresponding safe capacity as shown in fig. 3. As can be seen from fig. 4 and 5, the multidimensional optimization method based on reinforcement learning can reach a substantially steady state after about 2000 times of exploration, and can maintain the optimal capacity sum with a high probability.
In summary, the invention provides a multidimensional optimization method and system for an unmanned aerial vehicle communication system based on reinforcement learning, which jointly optimizes multiple dimensions such as flight trajectories and power distribution factors by adopting an optimization method combining convex optimization and reinforcement learning.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.