CN113364495A

CN113364495A - Multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method and system

Info

Publication number: CN113364495A
Application number: CN202110573024.6A
Authority: CN
Inventors: 张超; 亓乾月
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-09-07
Anticipated expiration: 2041-05-25
Also published as: CN113364495B

Abstract

The invention discloses a multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method and a system, wherein a wireless communication system model based on the assistance of a plurality of unmanned aerial vehicles and an intelligent reflecting surface is established, a signal sent by a user is reflected to a base station by the intelligent reflecting surface arranged on the unmanned aerial vehicle, a channel model in the wireless communication system model and energy consumption models of the unmanned aerial vehicle and the intelligent reflecting surface are determined, and the energy efficiency of the wireless communication system model is calculated; clustering ground users by using a K-means clustering algorithm, determining the position of the unmanned aerial vehicle in each cluster by using a priority experience playback MATD3 method, assisting users communicating with the base station by the unmanned aerial vehicle and the intelligent reflecting surface, and finishing joint optimization of tracks of the unmanned aerial vehicles and phase shift of the intelligent reflecting surface by the activated reflecting elements and the phase shift of the activated reflecting elements of the intelligent reflecting surface. The invention solves the problems of high communication delay and high power consumption of the existing offline optimization method.

Description

Multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method and system

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a method and a system for joint optimization of multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift.

Background

With the development of the internet of things technology, more and more devices need to access a communication network, and sometimes the devices are distributed in a very large range, if a single unmanned aerial vehicle and a single intelligent reflecting surface are used for providing services for a large number of communication devices, a large communication load is undoubtedly brought to the unmanned aerial vehicle, in addition, long-distance flight of the unmanned aerial vehicle consumes much time and energy, serious communication delay is generated, and a challenge is brought to the power consumption problem of the unmanned aerial vehicle.

In order to improve the service with low time delay and high reliability for user equipment, a plurality of unmanned aerial vehicles and a plurality of intelligent reflecting surfaces can be adopted for auxiliary communication, a K-mean clustering algorithm is used for dividing ground users into a plurality of areas, each unmanned aerial vehicle carrying the intelligent reflecting surfaces serves users in a certain area, and on the premise of ensuring good communication quality, the track of the unmanned aerial vehicle and the phase shift of the intelligent reflecting surfaces are jointly optimized by using a multi-agent reinforcement learning algorithm, so that the energy efficiency of a wireless communication system is maximized.

Disclosure of Invention

The invention aims to solve the technical problem of providing a multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method and system aiming at the defects in the prior art, and solves the problems of high communication delay and high power consumption of the existing unmanned aerial vehicle track and intelligent reflecting surface phase shift offline optimization method.

The invention adopts the following technical scheme:

a multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method comprises the following steps:

s1, establishing a wireless communication system model based on multiple unmanned aerial vehicles and intelligent reflecting surface assistance, reflecting a signal sent by a user to a base station by an intelligent reflecting surface installed on the unmanned aerial vehicles, determining a channel model in the wireless communication system model and energy consumption models of the unmanned aerial vehicles and the intelligent reflecting surface, and calculating the energy efficiency of the wireless communication system model;

s2, based on the channel model determined in the step S1 and the energy consumption models of the unmanned aerial vehicle and the intelligent reflecting surface, clustering ground users by using a K-means clustering algorithm, taking energy efficiency as an optimization target, then determining the position of the unmanned aerial vehicle in each cluster by using a priority experience playback MATD3 method, assisting users communicating with the base station by the unmanned aerial vehicle and the intelligent reflecting surface, and completing the joint optimization of the tracks of the unmanned aerial vehicles and the phase shift of the intelligent reflecting surface by the phase shift of the activated reflecting element and the activated reflecting element.

Specifically, in step S1, the wireless communication system model based on the assistance of multiple drones and the intelligent reflecting surfaceThe body is as follows: the number of randomly distributed users is U, the user U is divided into K areas, and the number of the users in each area is U_k，u₁+…+u_k+…+u_KThe number of the intelligent reflecting surfaces and the number of the unmanned aerial vehicles are K, and each unmanned aerial vehicle provided with the intelligent reflecting surface serves users in one area; the intelligent reflecting surface carried on the unmanned aerial vehicle adjusts the phase shift of the M reflecting elements through an integrated controller; the base station receives signals reflected by all the intelligent reflecting surfaces at the same time; the number of antennas of the base station is N, the number of reflecting elements of the intelligent reflecting surface is M, and the user is a single antenna; the coordinates of the base station are (x)_BS,y_BS,z_BS) The coordinates of the intelligent reflecting surface p are

The coordinates of the user q are

Only one user in one area sends signals, the signals sent by each user are reflected to a base station through an intelligent reflecting surface serving the area and are reflected to the base station through intelligent reflecting surfaces serving other areas, and meanwhile, the number of the users participating in communication and the number of the intelligent reflecting surfaces are K; each reflecting element of the intelligent reflecting surface independently adjusts the phase shift of an incident signal, simultaneously keeps the amplitude unchanged, and the phase shift matrix of the intelligent reflecting surface p is a diagonal matrix theta_p＝diag(ν_p) Element on diagonal

θ_pmRepresenting the phase shift of the mth reflecting element of the intelligent reflecting surface p; the matrix of activated reflecting elements of the intelligent reflecting surface is a diagonal matrix delta_p＝diag(υ_p) Element v on diagonal_p＝(δ_p1,…,δ_pm,…,δ_pM)，δ_pmIndicating whether the mth reflecting element of the intelligent reflecting surface p is activated.

Specifically, in step S1, the signal sent by the user is reflected by the intelligent reflection surface of the unmanned aerial vehicle to the base station in the decision stage, the flight stage, and the information transmission stage, where the decision stage is: the unmanned aerial vehicle selects which user to communicate with, and selects the position for information transmission, and the intelligent reflecting surface selects the activated reflecting element and the phase shift thereof; a flight phase: the unmanned aerial vehicle flies to the information transmission position selected in the decision stage along a straight line at a speed v; and (3) information transmission stage: the unmanned aerial vehicle hovers after reaching a specified position, the selected users send signals to the intelligent reflecting surface in the decision phase, and the activated reflecting elements of the intelligent reflecting surface reflect the signals sent by the users to the base station with corresponding phase offsets.

Specifically, in step S1, channels between the user and the intelligent reflective surface and between the intelligent reflective surface and the base station are modeled as rice channels, and a channel G from the user q to the intelligent reflective surface p_pqIs as follows;

where ρ represents the reference distance d₀Path loss at 1m, k₁Is the path loss exponent, beta is the Rice fading factor, d₁Is the euclidean distance between the user q and the intelligent reflecting surface p,

is a non-line-of-sight propagation component,

is a vector of the response of the array,

the cosine of the angle of arrival of the signal from user q to intelligent reflecting surface p, λ represents the wavelength of the carrier, and d represents the antenna spacing.

Channel F from intelligent reflecting surface p to base station_pComprises the following steps:

wherein d is₂Representing the euclidean distance between the intelligent reflecting surface p and the base station,

is a non-line-of-sight propagation component,

and

is an array response vector;

the received signal y of the base station is:

where S is a transmit signal matrix, H is a channel matrix, H_kIs the kth column, s, of the matrix H_kIs the k-th row of the matrix S, n represents the additive white Gaussian noise at the base station end, and the variance is sigma²The cyclic symmetric complex gaussian variable of (a);

regarding the interference of other users as noise, the SINR of the k-th user_kComprises the following steps:

information transmission rate R of kth user_kComprises the following steps:

where K is the number of users communicating with the base station at the same time, w_kFor the kth row of the zero-forcing detection filter matrix,

for making an intelligenceConjugate transpose of channel matrix between plane of reflection p and base station, [ theta ]_pIs a phase shift matrix of the intelligent reflecting surface p, Delta_pMatrix of activated reflecting elements, G, being intelligent reflecting surfaces p_pqFor the channel between user q and intelligent reflecting surface p, G_pkFor the channel between user k and intelligent reflecting surface p, σ²Is the variance of the noise.

Specifically, in step S1, energy efficiency EE_pFor the total energy that the data volume of transmission divided unmanned aerial vehicle p and intelligent plane of reflection p consumed, specifically be:

wherein,

energy consumed for unmanned aerial vehicle flying to designated location, G_pFor the data quantity transmitted to the base station by the user p through the assistance of the unmanned plane p and the intelligent reflecting surface p,

the energy consumed for the intelligent reflecting surface p,

for the propulsion power of drone p, T is the time required for the drone to fly to the designated location.

Specifically, in step S2, clustering the users by using a K-means clustering algorithm specifically includes:

and if the clustering centers of all the clusters are completely the same as the result obtained by the last calculation, the clustering criterion function is converged, and all the users are classified into the correct clusters.

Specifically, in step S2, determining the position of the drone in each cluster, the position of the user communicating with the base station, the activated reflection element of the intelligent reflection surface, and the phase shift of the activated element by using a priority experience playback MATD3 method, and completing the joint optimization of the trajectories of the multiple drones and the phase shift of the intelligent reflection surface specifically includes:

modeling optimization problems of unmanned aerial vehicle tracks and intelligent reflecting surface phase shift in a wireless communication system based on multiple unmanned aerial vehicles and intelligent reflecting surface assistance into a Markov game, wherein each unmanned aerial vehicle provided with the intelligent reflecting surface is used as an intelligent agent, and the kth intelligent agent observes the current environment state s_kBased on a strategy of pi_kSelecting an action a_kThe reward r obtained after the action acts on the environment_kThe environment will then be represented by a transition probability P (s'_k|s_k,a₁,…,a_K) Transition to New State s'_k；

In each moment, the kth agent observes the position of the unmanned aerial vehicle k at the last moment and the position of the user communicating with the base station in the kth cluster as a state s_kThe parameter of the training strategy network is theta_kWill state s_kAs input, the position of the kth unmanned aerial vehicle at the current moment, the activated user vector in the kth cluster for communicating with the base station, the activated element vector and the phase shift vector of the kth intelligent reflecting surface are output as the behavior a_k(ii) a The parameters of the first training value network and the second training value network are respectively omega_k1And ω_k2Two training value networks put the joint state s observed by each agent (s ═ s)₁,s₂,…,s_K) And the joint action a ═ a taken₁,a₂,…,a_K) As inputs, the joint state-behavior cost function Q is output separately_k1(s,a₁,a₂,…,a_K,ω_k1) And Q_k2(s,a₁,a₂,…,a_K,ω_k2) Target policy network will next state s'_kAs input, the next action a 'is output'_kAccording to the parameter theta of the training strategy network in a soft updating mode_kUpdating a parameter θ of a target policy network_k', the first and second target value networks input the next state-behavior pair (s', a '), respectively outputting Q'_k1(s',a₁',a'₂,…,a'_K,ω'_k1) And Q'_k2(s',a₁',a'₂,…,a'_K,ω'_k2) According to the parameter omega of the first training value network in a soft updating mode_k1And a parameter ω of the second training value network_k2Updating parameter omega 'of first target value network'_k1And a parameter ω 'of a second target value network'_k2；

Will (s, a)₁,a₂,…,a_K,r₁,r₂,…,r_KS') as an experience of the agent is stored in an experience memory, and when the experience memory reaches the maximum storage capacity, a small batch of experiences are sampled from the experience memory by using a priority experience playback method for training, and parameters of the strategy network and parameters of the value network are updated.

Further, the state s observed by each drone_kComprising two parts, respectively the position of drone K (K ═ 1,2, …, K) at the last moment,

and in the kth cluster, assisting the position of a user communicating with the base station by the kth unmanned aerial vehicle and the intelligent reflecting surface,

the dimensionality of the state sk is six dimensions; behavior a_kThe method comprises the following four parts:

i: position of kth unmanned aerial vehicle at current moment

ii: activated user vector communicating with base station in kth cluster at current time

Each of whichEach element represents whether the corresponding user is activated or not, the value of 0 represents that the corresponding user is not activated, the value of 1 represents that the corresponding user is activated, and the vector

Should satisfy

Indicating that only one activated user in a cluster is at any one time;

iii: activated element vector of k-th intelligent reflecting surface at current moment

Each element represents whether the corresponding reflection element is activated or not, the value of 0 represents that the corresponding reflection element is not activated, the value of 1 represents that the corresponding reflection element is activated, and the vector

Should satisfy

Indicating that the number of activated elements per intelligent reflective surface should be between 1 and M.

iv: phase shift vector of intelligent reflecting surface at current moment

Each of which represents a phase shift of the corresponding reflective element,

reward is defined as energy efficiency EE_k，r_k(s_k,a_k)＝EE_k。

Further, a strategic gradient method is usedParameter theta of training strategy network of new kth agent_kComprises the following steps:

wherein, J (theta)_k) Is a strategic objective function, F denotes the size of the small batch of samples,

the expression of the gradient operator is used to indicate,

is the policy learned by the kth agent,

to sample the state of the kth agent in the jth experience using the priority empirical playback method,

behavior of the kth agent in the jth experience;

parameter ω of training value network 1 for kth agent_k1And a parameter omega of the training value network 2_k2Updating by gradient back propagation of the neural network, and the loss functions are respectively:

wherein, w_jIn order to sample the weights for the importance,

represents a target Q value;

parameter theta 'of target policy network'_kParameters of the target value network 1Of several omega'_k1And parameter ω 'of target value network 2'_k2And respectively updating by using a soft updating mode:

θ′_k←αθ_k+(1-α)θ′_k

ω'_k1←αω_k1+(1-α)ω'_k1

ω'_k2←αω_k2+(1-α)ω'_k2

where α represents an update coefficient.

Another technical solution of the present invention is a system for joint optimization of multiple unmanned aerial vehicle trajectories and intelligent reflecting surface phase shifts, comprising:

the energy module is used for establishing a wireless communication system model based on multi-unmanned aerial vehicle and intelligent reflecting surface assistance, signals sent by a user are reflected to a base station by the intelligent reflecting surface installed on the unmanned aerial vehicle, a channel model in the wireless communication system model and energy consumption models of the unmanned aerial vehicle and the intelligent reflecting surface are determined, and the energy efficiency EE of the wireless communication system model is calculated_p；

And the optimization module is used for clustering ground users by using a K-means clustering algorithm based on a channel model determined by the energy module and energy consumption models of the unmanned aerial vehicles and the intelligent reflecting surfaces, taking energy efficiency as an optimization target, then determining the position of the unmanned aerial vehicle in each cluster by using a priority experience playback MATD3 method, assisting users communicating with the base station by the unmanned aerial vehicles and the intelligent reflecting surfaces, and completing the joint optimization of tracks of the multiple unmanned aerial vehicles and the phase shift of the intelligent reflecting surfaces by using the activated reflecting elements and the phase shift of the activated reflecting elements.

Compared with the prior art, the invention has at least the following beneficial effects:

a method for jointly optimizing the track of a plurality of unmanned aerial vehicles and the phase shift of an intelligent reflecting surface,

the channel model and the energy consumption model are established for calculating energy efficiency, the energy efficiency maximization is used as an optimization target to train the neural network, and finally the neural network learns a strategy for enabling the wireless communication system to obtain the maximum energy efficiency. The phase shift of the unmanned aerial vehicle track and the intelligent reflecting surface is optimized by using the priority experience playback MATD3 method, so that the unmanned aerial vehicle and the intelligent reflecting surface can self-adaptively adjust own strategies according to the change of the environment, and the robustness is strong.

Furthermore, the user is divided into a plurality of areas, and an unmanned aerial vehicle provided with an intelligent reflecting surface is arranged in each area to provide services for the user, so that the problems of high power consumption and high communication delay caused by long-distance flight of the unmanned aerial vehicle can be avoided.

Further, in the decision phase, the drones in each area select which user to communicate with and select the location of information transmission, and the intelligent reflective surface selects the reflective element that needs to be activated and determines the phase shift of the activated element. In the flight phase, the unmanned aerial vehicle flies along a straight line to the information transmission position determined in the decision phase. In the information transmission stage, the selected user sends a signal in the decision stage, and the intelligent reflecting surface reflects the signal sent by the user to the base station.

Furthermore, establishing a proper channel model is the basis for accurately calculating the information transmission rate, and the energy efficiency of the system can be further calculated after the information transmission rate is obtained.

Furthermore, the energy efficiency is used as an optimization target to design the track of the unmanned aerial vehicle and the phase shift of the intelligent reflecting surface, so that the aim of maximizing the energy efficiency of the system can be achieved.

Further, when the region that unmanned aerial vehicle needs service is very big, in order to improve communication quality and practice thrift unmanned aerial vehicle's energy, need cluster the user, every unmanned aerial vehicle who installs intelligent plane serves the user in a cluster, and unmanned aerial vehicle flies at this cluster coverage, provides service for the user in the cluster.

Furthermore, the unmanned aerial vehicle and the intelligent reflecting surface in each cluster are used as an intelligent agent, and the intelligent agents learn by using a distributed execution and centralized training mode, so that experience sharing can be realized, and an optimal strategy which can enable the energy efficiency of the system to be the highest can be learned quickly. The samples are extracted from the experience memory by using the priority experience playback method, so that the experience with higher learning value can be learned more frequently, and the learning efficiency is improved. The TD3 algorithm can solve the problem of overestimation of Q values, thereby enabling the value network to make an accurate assessment of the value of the state-behavior pairs.

Further, the channel state is related to the position of the user and the unmanned aerial vehicle, and the channel state is an important basis for determining the optimal position of the unmanned aerial vehicle for information transmission and the phase shift of the intelligent reflecting surface, and the state s is obtained by combining_kSet up to the position of last moment unmanned aerial vehicle and carry out the position of the user that communicates with the basic station, can make the intelligent agent learn the hidden relation between unmanned aerial vehicle position and user position and the channel state to can be directly with state s_kMapping to behavior a that maximizes energy efficiency_kWithout obtaining accurate channel state information. By taking the position of the unmanned aerial vehicle, the matrix of the activated elements of the intelligent reflecting surface and the phase shift matrix as the behaviors a_kThe intelligent reflecting surface can establish a high-quality line-of-sight propagation link between the user and the base station and reflect signals sent by the user to the base station.

Furthermore, by solving the gradient of the strategy objective function and adjusting the parameters of the training strategy network to maximize the Q value, a strategy that can map the state to the optimal behavior can be found. And updating parameters of the training value network by using a gradient descent method to minimize a loss function, so that the value of the value network to the state behavior pair can be accurately evaluated. The parameters of the target strategy network and the target value network are updated in a soft updating mode, so that the stability of the algorithm can be improved.

In summary, the invention uses a plurality of unmanned aerial vehicles and a plurality of intelligent reflecting surfaces for auxiliary communication, uses a K-means clustering algorithm to cluster users, and each unmanned aerial vehicle and intelligent reflecting surface serve users in one cluster; the priority experience playback MATD3 method enables the intelligent bodies to learn the strategies adopted by other intelligent bodies in a centralized training mode, and experience of all the intelligent bodies is shared, so that joint optimization of multiple unmanned aerial vehicle tracks and phase shift of an intelligent reflecting surface is rapidly achieved, and the energy efficiency of the system is maximized.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a diagram of a system model of the present invention;

FIG. 2 is a diagram illustrating a process of transmitting information from a user to a base station according to the present invention;

FIG. 3 is a flow chart of a K-means clustering algorithm;

FIG. 4 is a block diagram of a method for priority empirical review MATD 3;

fig. 5 is a diagram illustrating the effect of user transmit power on energy efficiency.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.

The invention provides a Multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method, which comprises the steps of firstly establishing a wireless communication system model based on the assistance of multiple unmanned aerial vehicles and intelligent reflecting surfaces, and secondly, providing a priority experience playback MATD3 method (MATD 3) aiming at the non-convexity of the track and phase shift optimization problem, so as to realize the joint optimization of the Multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint.

The invention discloses a multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method, which comprises the following steps:

s1, establishing a wireless communication system model based on the assistance of multiple unmanned aerial vehicles and intelligent reflecting surfaces, and then respectively discussing the channels and the energy consumed by the unmanned aerial vehicles and the intelligent reflecting surfaces;

as shown in FIG. 1, the communication model is that the number of users randomly distributed in a certain range is set as U, the users are divided into K regions, and the number of users in each region is set as U_k，u₁+…+u_k+…+u_KU. The number of intelligent plane of reflection and unmanned aerial vehicle all is K, and every unmanned aerial vehicle of installing intelligent plane of reflection serves the user in an area. The intelligent reflecting surface carried on the unmanned aerial vehicle adjusts the phase shift of the M reflecting elements through an integrated controller. And the base station receives all the signals reflected by the intelligent reflecting surface at the same time. Suppose the number of antennas of the base station is N, the number of reflecting elements of the intelligent reflecting surface is M, and the user is a single antenna. Let the coordinates of the base station be (x)_BS,y_BS,z_BS) The coordinates of the intelligent reflecting surface p are

The coordinates of the user q are

At a certain moment, only one user in one area sends a signal, the signal sent by each user can be reflected to the base station through the intelligent reflecting surface serving the area, and can be reflected to the base station through the intelligent reflecting surfaces serving other areas, and the number of the users participating in communication and the number of the intelligent reflecting surfaces are both K. Each reflecting element of the intelligent reflecting surface can independently adjust the phase shift of an incident signal while keeping the amplitude of the incident signal unchanged, and the phase shift matrix of the intelligent reflecting surface p is a diagonal matrix theta_p＝diag(ν_p) Element on diagonal

Wherein, theta_pmRepresenting the phase shift of the mth reflecting element of the intelligent reflecting surface p.

The matrix of activated reflecting elements of the intelligent reflecting surface is also a diagonal matrix delta_p＝diag(υ_p) Element on diagonal

υ_p＝(δ_p1,…,δ_pm,…,δ_pM) (2)

Wherein, delta_pmIndicating whether the mth reflecting element of the intelligent reflecting surface p is activated,

referring to fig. 2, the process of transmitting information from the user to the base station is divided into three stages, specifically:

1) a decision stage: the unmanned aerial vehicle selects which user to communicate with, and selects the position to transmit information, and the intelligent reflecting surface selects the activated reflecting element and the phase shift thereof.

2) A flight phase: the drone flies in a straight line at a speed v towards the information transmission location selected in the decision phase.

3) And (3) information transmission stage: after the unmanned aerial vehicle reaches a specified position, the unmanned aerial vehicle hovers at the position, the selected user sends a signal to the intelligent reflecting surface in the decision phase, and the activated reflecting element of the intelligent reflecting surface reflects the signal sent by the user to the base station with a certain phase offset.

Modeling channels between the user and the intelligent reflecting surface and between the intelligent reflecting surface and the base station into a Leise channel, and setting a channel from the user q to the intelligent reflecting surface p as

The method specifically comprises the following steps:

where ρ represents the reference distance d₀Path loss at 1m, k₁Is the path loss exponent, beta is the rice fading factor,

is the euclidean distance between the user q and the intelligent reflecting surface p,

a non-line-of-sight propagation component, each of which is modeled as a circularly symmetric complex gaussian variable with zero mean and unit variance,

is a vector of the response of the array,

a cosine value representing the angle of arrival of the signal from user q to intelligent reflecting surface p.

The channel from the intelligent reflecting surface p to the base station is

The method specifically comprises the following steps:

wherein,

representing the euclidean distance between the intelligent reflecting surface p and the base station,

is an array response vector, specifically:

and

can be expressed as:

wherein,

and

representing cosine values of the departure angle and arrival angle of the signal, respectively.

Let S be the transmission signal matrix, H be the channel matrix, and S be the transmission signal of user K (K {1, …, K, …, K })_kThen, the received signal of the base station is:

wherein h is_kIs the kth column, s, of the matrix H_kIs the k-th row of the matrix S, n represents the additive white Gaussian noise at the base station end, the mean is 0, and the variance is sigma²Of circularly symmetric complex Gaussian variables, i.e.

In an uplink multi-user communication system, since multiple users transmit signals on the same frequency band at the same time, co-channel interference exists. In order to suppress co-channel interference between users and successfully detect signals transmitted by each user, the base station may use a zero-forcing detection algorithm to eliminate interference between signals transmitted by different antennas at the signal receiving end through linear transformation.

To recover s at the base station_kWhile excluding interference from signals transmitted by other users, using the matrix W_ZFBy inner-product with the received signal y to obtain an equalized signal, i.e.

W_ZFy＝W_ZFHS+W_ZFn (10)

w_kAs a matrix W_ZFThe following should be satisfied for line k of (1):

matrix W_ZFShould satisfy W_ZFAnd H is a unit array, and specifically comprises the following components:

W_ZF＝(H^HH)^-1H^H (12)

assuming that the channel matrix H is full rank, the estimated value of the transmitted signal is then

Can be expressed as:

the estimated value of the transmitted signal after the zero forcing detector completely eliminates the interference between the transmitted signals of different users.

Regarding the interference of other users as noise, the signal-to-interference-and-noise ratio of the kth user is:

the information transmission rate of the kth user is:

is the conjugate transpose of the channel matrix between the intelligent reflecting surface p and the base station, theta_pIs a phase shift matrix of the intelligent reflecting surface p, Delta_pMatrix of activated reflecting elements, G, being intelligent reflecting surfaces p_pqFor the channel between user q and intelligent reflecting surface p, G_pkFor the channel between user k and intelligent reflecting surface p, σ²Is the variance of the noise.

Energy consumption in the multi-user uplink transmission system based on the assistance of the unmanned aerial vehicle and the intelligent reflecting surface comprises two parts, namely energy consumed by flight of the unmanned aerial vehicle and energy consumed by the activated reflecting element of the intelligent reflecting surface, wherein the propulsion power of the pth unmanned aerial vehicle is as follows:

wherein, U_tipIs the speed, v, of the rotor blade tip of the drone₀Is the average induced velocity of the rotor during hover, χ is the fuselage drag ratio, κ is the air density, u is the rotor solidity, a is the rotor disk area,

is the profile drag coefficient, Ω is the blade angular velocity, γ is the rotor radius, ψ is the incremental coefficient of dependence of induced power, W is the weight of the drone, v_pIs the speed of the pth unmanned aerial vehicle, the calculation process is as follows:

(t-1) the position of the drone p is

the position at time t is

The distance traveled by drone p to fly from the location at time (t-1) to the location at time t is:

if the time spent by the flight of the unmanned aerial vehicle is T, the speed v of the pth unmanned aerial vehicle_pComprises the following steps:

the energy consumed when the unmanned plane p flies to a specified position is as follows:

let delta_pmIndicating whether the m-th reflecting element of the intelligent reflecting surface p is activated, p_IRSRepresenting the power consumed by each reflecting element, the power consumed by the entire intelligent reflecting surface p is:

the duration of the information transmission phase is tau, and the energy consumed by the intelligent reflecting surface p in the period is:

in the information transmission stage, in the kth cluster, a user p is assisted by an unmanned aerial vehicle p and an intelligent reflecting surface p, and the data volume transmitted to a base station is as follows:

G_p＝R_pτ (24)

energy efficiency is the amount of data transmitted divided by the total energy consumed by the drone p and the intelligent reflective surface p:

s2, based on the channel model and the energy consumption model in the step S1, clustering ground users by using a K-means clustering algorithm, then determining the position of the unmanned aerial vehicle in each cluster by using a priority experience playback MATD3 method, assisting users communicating with the base station by the unmanned aerial vehicle and the intelligent reflecting surface, and completing the joint optimization of the track of the unmanned aerial vehicle and the phase shift of the intelligent reflecting surface by using the activated reflecting element and the phase shift of the intelligent reflecting surface in the information transmission stage.

Referring to fig. 3, the basic idea of the K-means clustering algorithm is to first designate a K value, randomly extract K users from all users as initial clustering centers, then calculate distances between the remaining all users and the K initial clustering centers, and partition the user closest to which clustering center to the clustering center. And for each newly formed cluster, the clustering center is obtained by calculating the average value of samples in the cluster, and if the clustering centers of all the clusters are completely the same as the result obtained by the last calculation, the clustering criterion function is converged, and all the users are divided into the correct clusters.

After the ground users are divided into a plurality of clusters by using a K-means clustering algorithm, an unmanned aerial vehicle provided with an intelligent reflecting surface can be placed in each cluster, and the unmanned aerial vehicle flies in the coverage range of the cluster to provide service for the users in the cluster. The trajectory of multiple drones and the phase shift of the intelligent reflective surface are jointly optimized using the priority empirical review MATD3 algorithm to maximize the energy efficiency of the system, the algorithm framework being shown in fig. 4. Modeling optimization problems of unmanned aerial vehicle tracks and intelligent reflecting surface phase shift in a wireless communication system based on multiple unmanned aerial vehicles and intelligent reflecting surface assistance into a Markov game, wherein each unmanned aerial vehicle provided with the intelligent reflecting surface is used as an intelligent agent, and the kth intelligent agent observes the current environment state s_kBased on a strategy of pi_kSelecting an action a_kThe reward r obtained after the action acts on the environment_kThe environment will then be represented by a transition probability P (s'_k|s_k,a₁,…,a_K) Transition to New State s'_k。

State s observed by each drone_kComprising two parts, respectively the position of drone K (K ═ 1,2, …, K) at the last moment,

state s_kThe dimensions of (a) are six dimensions:

behavior of kth agent a_kIs one dimension of (3+ u)_k+2 × M) vector, u_kIs the number of users in the kth cluster, action a_kThe method comprises the following four parts:

i: position of kth unmanned aerial vehicle at current moment

Each element represents whether the corresponding user is activated or not, the value of 0 represents that the corresponding user is not activated, the value of 1 represents that the corresponding user is activated, and the vector

Should satisfy

Indicating that only one activated user in a cluster is at any one time;

Each of which indicates whether the corresponding reflective element is activated or notThe value of 0 indicates that the corresponding reflection element is not activated, the value of 1 indicates that the corresponding reflection element is activated, and the vector is expressed

Should satisfy

iv: phase shift vector of intelligent reflecting surface at current moment

Each of which represents a phase shift of the corresponding reflective element,

prize r_k(s_k,a_k) Defined as energy efficiency EE_kCalculated from equation (25).

For a multi-agent system, each agent has six neural networks, namely a training strategy network, a target strategy network, a first training value network, a second training value network, a first target value network and a second target value network. In each moment, the kth agent observes the position of the unmanned aerial vehicle k at the last moment and the position of the user communicating with the base station in the kth cluster as a state s_kThe parameter of the training strategy network is theta_kWill state s_kAs input, the position of the kth unmanned aerial vehicle at the current moment, the activated user vector in the kth cluster for communicating with the base station, the activated element vector and the phase shift vector of the kth intelligent reflecting surface are output as the behavior a_k(ii) a The parameters of the first training value network and the second training value network are respectively omega_k1And ω_k2The two networks view the respective agentsThe measured joint state s ═ s(s)₁,s₂,…,s_K) And the joint action a ═ a taken₁,a₂,…,a_K) As inputs, the joint state-behavior cost function Q is output separately_k1(s,a₁,a₂,…,a_K,ω_k1) And Q_k2(s,a₁,a₂,…,a_K,ω_k2) Target policy network will next state s'_kAs input, the next action a 'is output'_kAccording to the parameter theta of the training strategy network in a soft updating mode_kUpdating a parameter θ of a target policy network_k', the first and second target value networks input the next state-behavior pair (s', a '), respectively outputting Q'_k1(s',a′₁,a'₂,…,a'_K,ω'_k1) And Q'_k2(s',a′₁,a'₂,…,a'_K,ω'_k2) According to the parameter omega of the first training value network in a soft updating mode_k1And a parameter ω of the second training value network_k2Updating parameter omega 'of first target value network'_k1And a parameter ω 'of a second target value network'_k2。

The probability that experience j is sampled is:

where γ represents the importance of the priority, F represents the number of small batch extractions, D_jRank (1/rank) (j), rank (j) is the ranking of the jth empirical learning value.

The importance sampling weights are:

e is the number of stored experiences in the experience memory and ξ is the sampling weight coefficient.

Updating parameter theta of training strategy network of kth intelligent agent by using strategy gradient method_k：

Wherein, J (theta)_k) Is a strategic objective function, means a gradient operator,

is the policy learned by the kth agent,

the behavior of the kth agent in the jth experience.

wherein, w_jIn order to sample the weights for the importance,

representing the target Q value.

The loss function represents the difference between the Q value output by the training value network and the target Q value, and the Q value output by the training value network is very close to the target Q value by updating the parameters of the training value network by using a gradient descent method to minimize the loss function, so that the value of the training value network on the state-behavior pair can be accurately evaluated.

Parameter theta 'of target policy network'_kParameter ω 'of target value network 1'_k1And parameter ω 'of target value network 2'_k2And respectively updating by using a soft updating mode:

θ′_k←αθ_k+(1-α)θ′_k (34)

ω'_k1←αω_k1+(1-α)ω'_k1 (35)

ω'_k2←αω_k2+(1-α)ω'_k2 (36)

where α represents an update coefficient.

In another embodiment of the present invention, a method and a system for joint optimization of multi-drone trajectory and intelligent reflection surface phase shift are provided, where the system can be used to implement the method and the system for joint optimization of multi-drone trajectory and intelligent reflection surface phase shift, and specifically, the method and the system for joint optimization of multi-drone trajectory and intelligent reflection surface phase shift include an energy module and an optimization module.

The energy module establishes a wireless communication system model based on multiple unmanned aerial vehicles and intelligent reflecting surface assistance, signals sent by a user are reflected to a base station by the intelligent reflecting surface installed on the unmanned aerial vehicles, a channel model in the wireless communication system model and energy consumption models of the unmanned aerial vehicles and the intelligent reflecting surface are determined, and energy efficiency EE of the wireless communication system model is calculated_p；

And the optimization module is used for clustering ground users by using a K-means clustering algorithm based on a channel model determined by the energy module and energy consumption models of the unmanned aerial vehicles and the intelligent reflecting surfaces, then determining the position of the unmanned aerial vehicle in each cluster by using a priority experience playback MATD3 method, assisting users communicating with the base station by the unmanned aerial vehicles and the intelligent reflecting surfaces, and completing the joint optimization of tracks of the multiple unmanned aerial vehicles and the phase shift of the intelligent reflecting surfaces by using the activated reflecting elements and the phase shift of the activated reflecting elements.

In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor provided by the embodiment of the invention can be used for the operation of the multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method and system, and comprises the following steps:

establishing a wireless communication system model based on multiple unmanned aerial vehicles and intelligent reflecting surface assistance, reflecting a signal sent by a user to a base station by an intelligent reflecting surface installed on the unmanned aerial vehicles, determining a channel model in the wireless communication system model and energy consumption models of the unmanned aerial vehicles and the intelligent reflecting surface, and calculating the energy efficiency of the wireless communication system model; based on the determined channel model and the energy consumption models of the unmanned aerial vehicle and the intelligent reflecting surface, clustering ground users by using a K-means clustering algorithm, taking energy efficiency as an optimization target, then determining the position of the unmanned aerial vehicle in each cluster by using priority experience playback MATD3, assisting users communicating with the base station by the unmanned aerial vehicle and the intelligent reflecting surface, and completing the joint optimization of the tracks of the unmanned aerial vehicles and the phase shift of the intelligent reflecting surface by using the activated reflecting elements and the phase shift of the activated reflecting elements.

In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor can load and execute one or more instructions stored in the computer readable storage medium to realize the corresponding steps of the method and the system for joint optimization of the multi-unmanned aerial vehicle track and the intelligent reflecting surface phase shift in the embodiment; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of:

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The joint optimization algorithm for multi-drone trajectory and intelligent reflective surface phase shift based on priority empirical playback MATD3 is summarized as follows:

the simulation parameters are set as follows:

referring to fig. 5, the energy efficiency of the system varies with the user transmission power when the maddppg method, the MATD3 method, the priority empirical review maddppg method, and the priority empirical review MATD3 method are used. As can be seen from the figure, the energy efficiency of the system is higher when the priority experience replay method is used than when the priority experience replay method is not used, and the energy efficiency of the system is higher when the MATD3 method is used than when the maddppg method is used, because the probability that the experience with higher learning value in the experience memory is sampled is increased when the priority experience replay is used, learning from the experiences increases the learning efficiency, and the MATD3 method can overcome the problem that the Q value is overestimated, so that the value of the state-behavior pair is accurately evaluated by the value network. In addition, when the transmission power of the user increases, the amount of data to be transmitted increases, and thus the energy efficiency of the system increases.

In summary, the method and the system for joint optimization of multiple unmanned aerial vehicle tracks and intelligent reflecting surface phase shifts consider an uplink wireless communication system based on multiple unmanned aerial vehicles and intelligent reflecting surface assistance, firstly, ground users are clustered, an unmanned aerial vehicle provided with an intelligent reflecting surface is distributed to each cluster to provide service for users in the cluster, and then joint optimization of unmanned aerial vehicle tracks and intelligent reflecting surface phase shifts in each cluster is completed by using a priority experience playback MATD3 method, so that the purpose of maximum energy efficiency of the system is achieved.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method is characterized by comprising the following steps:

2. The method according to claim 1, wherein in step S1, the wireless communication system model based on multiple drones and the intelligent reflector assistance is specifically: the number of randomly distributed users is U, the user U is divided into K areas, and the number of the users in each area is U_k，u₁+…+u_k+…+u_KThe number of the intelligent reflecting surfaces and the number of the unmanned aerial vehicles are K, and each unmanned aerial vehicle provided with the intelligent reflecting surface serves users in one area; the intelligent reflecting surface carried on the unmanned aerial vehicle adjusts the phase shift of the M reflecting elements through an integrated controller; the base station receives signals reflected by all the intelligent reflecting surfaces at the same time; the number of antennas of the base station is N, the number of reflecting elements of the intelligent reflecting surface is M, and the user is a single antenna; the coordinates of the base station are (x)_BS,y_BS,z_BS) The coordinates of the intelligent reflecting surface p are

The coordinates of the user q are

Only one user in a region transmits a signal, each user transmittingThe transmitted signals are reflected to the base station through the intelligent reflecting surface serving the area and are reflected to the base station through the intelligent reflecting surfaces serving other areas, and meanwhile, the number of users participating in communication and the number of the intelligent reflecting surfaces are K; each reflecting element of the intelligent reflecting surface independently adjusts the phase shift of an incident signal, simultaneously keeps the amplitude unchanged, and the phase shift matrix of the intelligent reflecting surface p is a diagonal matrix theta_p＝diag(ν_p) Element on diagonal

3. The method according to claim 1, wherein in step S1, the signal sent by the user is reflected by the intelligent reflection surface of the drone to the base station in the decision phase, the flight phase and the information transmission phase, and the decision phase is: the unmanned aerial vehicle selects which user to communicate with, and selects the position for information transmission, and the intelligent reflecting surface selects the activated reflecting element and the phase shift thereof; a flight phase: the unmanned aerial vehicle flies to the information transmission position selected in the decision stage along a straight line at a speed v; and (3) information transmission stage: the unmanned aerial vehicle hovers after reaching a specified position, the selected users send signals to the intelligent reflecting surface in the decision phase, and the activated reflecting elements of the intelligent reflecting surface reflect the signals sent by the users to the base station with corresponding phase offsets.

4. The method of claim 1, wherein in step S1, the channels between the user and the intelligent reflective surface and between the intelligent reflective surface and the base station are modeled as rice channels, and the channel G from the user q to the intelligent reflective surface p_pqIs as follows;

where ρ represents the reference distance d₀Path loss at 1m, k₁Is the path loss exponent, beta is the Rice fading factor, d₁Is the Euclidean distance, G, between the user q and the intelligent reflecting surface p_pqIs a non-line-of-sight propagation component,

is a vector of the response of the array,

a cosine value representing an arrival angle of a signal from a user q to the intelligent reflecting surface p, wherein lambda represents the wavelength of a carrier wave, and d represents the antenna spacing;

wherein d is₂Representing the Euclidean distance, F, between the intelligent reflecting surface p and the base station_pIs a non-line-of-sight propagation component,

and

is an array response vector;

the received signal y of the base station is:

where S is a transmit signal matrix, H is a channel matrix, H_kIs the kth column, s, of the matrix H_kIs the kth row of the matrix SN represents additive white Gaussian noise at the base station end, and the variance is sigma²The cyclic symmetric complex gaussian variable of (a);

information transmission rate R of kth user_kComprises the following steps:

5. The method according to claim 1, characterized in that in step S1, energy efficiency EE_pFor the total energy that the data volume of transmission divided unmanned aerial vehicle p and intelligent plane of reflection p consumed, specifically be:

wherein,

fly to the finger for unmanned aerial vehicleEnergy consumed at fixed position, G_pFor the data quantity transmitted to the base station by the user p through the assistance of the unmanned plane p and the intelligent reflecting surface p,

the energy consumed for the intelligent reflecting surface p,

6. The method according to claim 1, wherein in step S2, the users are clustered using a K-means clustering algorithm, specifically:

7. The method of claim 1, wherein in step S2, the position of the drone in each cluster, the position of the user communicating with the base station, the activated reflective element of the intelligent reflective surface, and the phase shift of the activated element are determined by using a priority empirical review MATD3 method, and the joint optimization of the trajectories of the multiple drones and the phase shift of the intelligent reflective surface is specifically performed as follows:

modeling optimization problems of unmanned aerial vehicle tracks and intelligent reflecting surface phase shift in a wireless communication system based on multiple unmanned aerial vehicles and intelligent reflecting surface assistance into a Markov game, wherein each unmanned aerial vehicle provided with the intelligent reflecting surface is used as an intelligent agent, and the kth intelligent agent observes the current environment state s_kBased on a strategy of pi_kSelecting an actiona_kThe reward r obtained after the action acts on the environment_kThe environment will then be represented by a transition probability P (s'_k|s_k,a₁,…,a_K) Transition to New State s'_k；

In each moment, the kth agent observes the position of the unmanned aerial vehicle k at the last moment and the position of the user communicating with the base station in the kth cluster as a state s_kThe parameter of the training strategy network is theta_kWill state s_kAs input, the position of the kth unmanned aerial vehicle at the current moment, the activated user vector in the kth cluster for communicating with the base station, the activated element vector and the phase shift vector of the kth intelligent reflecting surface are output as the behavior a_k(ii) a The parameters of the first training value network and the second training value network are respectively omega_k1And ω_k2Two training value networks put the joint state s observed by each agent (s ═ s)₁,s₂,…,s_K) And the joint action a ═ a taken₁,a₂,…,a_K) As inputs, the joint state-behavior cost function Q is output separately_k1(s,a₁,a₂,…,a_K,ω_k1) And Q_k2(s,a₁,a₂,…,a_K,ω_k2) Target policy network will next state s'_kAs input, the next action a 'is output'_kAccording to the parameter theta of the training strategy network in a soft updating mode_kUpdating parameter theta 'of target policy network'_kThe first target value network and the second target value network input the next state-behavior pair (s ', a'), and output them respectively

And Q'_k2(s',a′₁,a′₂,…,a′_K,ω′_k2) According to the parameter omega of the first training value network in a soft updating mode_k1And a parameter ω of the second training value network_k2Updating parameter omega 'of first target value network'_k1And a parameter ω 'of a second target value network'_k2；

8. The method of claim 7, wherein the state s observed by each drone_kComprising two parts, respectively the position of drone K (K ═ 1,2, …, K) at the last moment,

state s_kThe dimension of (A) is six; behavior a_kThe method comprises the following four parts:

i: position of kth unmanned aerial vehicle at current moment

Should satisfy

Is shown inAt any moment, only one activated user in one cluster is available;

Should satisfy

The number of activated elements of each intelligent reflecting surface is between 1 and M;

iv: phase shift vector of intelligent reflecting surface at current moment

Each of which represents a phase shift of the corresponding reflective element,

reward is defined as energy efficiency EE_k，r_k(s_k,a_k)＝EE_k。

9. The method of claim 7, wherein the parameter θ of the training strategy network of the kth agent is updated using a strategy gradient method_kComprises the following steps:

the expression of the gradient operator is used to indicate,

is the policy learned by the kth agent,

behavior of the kth agent in the jth experience;

wherein, w_jIn order to sample the weights for the importance,

represents a target Q value;

θ′_k←αθ_k+(1-α)θ′_k

ω′_k1←αω_k1+(1-α)ω′_k1

ω′_k2←αω_k2+(1-α)ω′_k2

where α represents an update coefficient.

10. The utility model provides a many unmanned aerial vehicle orbit and intelligent plane of reflection phase shift joint optimization system which characterized in that includes: