CN109743210B

CN109743210B - Unmanned aerial vehicle network multi-user access control method based on deep reinforcement learning

Info

Publication number: CN109743210B
Application number: CN201910074944.6A
Authority: CN
Inventors: 梁应敞; 曹阳; 张蔺
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-01-25
Filing date: 2019-01-25
Publication date: 2020-04-17
Anticipated expiration: 2039-01-25
Also published as: CN109743210A

Abstract

The invention belongs to the technical field of wireless communication, and relates to an unmanned aerial vehicle network multi-user access control method based on deep reinforcement learning. The invention provides a deep reinforcement learning framework under the condition of adapting to multi-user access in an unmanned aerial vehicle network by utilizing the inherent change rule in the deep reinforcement learning environment, and realizes the multi-user access control scheme of the unmanned aerial vehicle network based on the deep reinforcement learning under the condition of unknown global network information. Compared with the traditional access control mode, the access control mode provided by the invention can realize higher system throughput and lower switching times. Meanwhile, different compromises can be realized on throughput and switching times by adjusting the switching penalty items, and the performance can be guaranteed under different switching penalty conditions.

Description

Unmanned aerial vehicle network multi-user access control method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of wireless communication, and relates to an unmanned aerial vehicle network multi-user access control method based on deep reinforcement learning.

Background

Conventional access control techniques utilize a threshold comparison method, which is implemented by selecting different metrics (e.g., received signal strength, etc.) and selecting an appropriate threshold. When the received signal strength of the User Equipment (UE) from the source base station is lower than the set threshold, the base station which can provide the received signal strength higher than the threshold is selected for access. However, for the network of the unmanned aerial vehicle using the unmanned aerial vehicle as the base station, because the base station has mobility, the relative distance between the base station and the user changes frequently, which causes the intensity of the received signal at the user to change violently, and at this time, the conventional access control technology brings the problem of frequent switching, which causes a large amount of extra signal overhead; in addition, when a plurality of UEs are switched simultaneously, the conventional access control technology can only ensure the throughput of a single user, but cannot ensure the throughput of the entire system.

Disclosure of Invention

In order to solve the problem of frequent switching of the traditional access control technology in an unmanned aerial vehicle network and ensure the overall throughput of a multi-user access situation network, the invention mainly focuses on the conditions of the long-term throughput and the switching times of the overall system. Because the deep reinforcement learning has excellent performance in the complex dynamic environment decision problem, in order to overcome the problem that the global network information in the unmanned aerial vehicle network environment is difficult to collect, the invention provides a deep reinforcement learning framework which is suitable for the condition of multi-user access in the unmanned aerial vehicle network by utilizing the inherent change rule in the deep reinforcement learning environment, and realizes the unmanned aerial vehicle network multi-user access control scheme based on the deep reinforcement learning under the condition that the global network information is unknown.

In the invention, a system model is established from the perspective of providing service for ground users by using the unmanned aerial vehicle as a mobile base station, and the unmanned aerial vehicle moves according to a preset track to provide downlink transmission service for ground UE. In the invention, each UE is regarded as an independent decision maker, and selects a suitable drone base station for access in each time slot. The invention completely hands over the decision process to the UE for execution, and the unmanned aerial vehicle base station is only responsible for receiving the access request and providing the transmission service. In the invention, information interaction does not exist among a plurality of UEs in the decision process, namely the decision process of the UE only depends on the network information obtained by the UE, thereby reducing the overall signal overhead.

In order to solve the problem of multi-user access decision, the invention provides a deep strong learning framework for distributed decision centralized training, namely, a central node is responsible for training neural network parameters of all UE. In the deep reinforcement learning framework provided by the invention, each UE is provided with a neural network with the same structure, and a corresponding access strategy is obtained after local network information is input into the neural network; the central node is responsible for collecting experience information from each UE and training neural network parameters, and the central node transmits the trained parameters to the user after each training stage is completed. And after the UE acquires the trained neural network parameters from the central node, updating the local neural network parameters. The invention separates the decision-making process and the training process, so that the UE only needs to utilize the trained neural network, and the calculation complexity at the UE is reduced.

In order to solve the problem that the position information of the base station in the unmanned aerial vehicle network is difficult to collect, the invention avoids the position information in the design of the user state, mainly adopts the information such as the received signal strength of the user, and the information can be directly measured locally. In order to avoid frequent switching and ensure the throughput performance of the whole network under the condition of multiple users, the invention not only considers the throughput of the users in the design of the deep reinforcement learning reward function, but also considers the switching inhibition of the UE and the influence of the access action of a single UE on other related UEs.

In order to better capture and learn the received signal strength change rule at the UE, the invention also introduces a long-short term memory (LSTM) network into the neural network design. The neural network of the invention has simple design, and the LSTM is used for extracting the characteristics and then is processed by the three-layer full-connection network to obtain the corresponding access decision output.

Compared with the traditional access control mode, the access control mode provided by the invention can realize higher system throughput and lower switching times. Meanwhile, different compromises can be realized on throughput and switching times by adjusting the switching penalty items, and the performance can be guaranteed under different switching penalty conditions.

Drawings

Figure 1 shows a system model of a drone network in accordance with the present invention;

FIG. 2 illustrates a deep reinforcement learning framework model in accordance with the present invention;

FIG. 3 shows a structural model of a neural network in the present invention;

fig. 4 shows the throughput and the number of handovers of the access control scheme proposed by the present invention compared to a conventional access control scheme.

Detailed Description

The invention is described in detail below with reference to the drawings and simulation examples so that those skilled in the art can better understand the invention.

FIG. 1 shows a system model of the present invention. The wireless communication system has two parts, namely an unmanned aerial vehicle base station and ground UE. The unmanned aerial vehicle basic station flies according to the fixed orbit in the air, ground UE. Since the drone base station is flying in the air, there are two components in the channel, line of sight (LOS) and non-line of sight (NLOS), the proportion of the two components appearing is mainly determined by the elevation angle between the drone and the ground user. Both LOS and NLOS components include large-scale fading and small-scale fading, the large-scale fading is mainly determined by the distance between the UE and the base station, and the small-scale fading follows rice distribution and rayleigh distribution, respectively. In particular, the channel gain model between the jth drone base station and the ith ground UE may be expressed as:

wherein the content of the first and second substances,

and

respectively showing the proportion of the occurrence of LOS and NLOS components,

and

respectively, representing the corresponding channel gains. f denotes the carrier frequency and v denotes the speed of light. Mu.s_LOSAnd mu_NLOSAttenuation factors, l, corresponding to LOS and NLOS, respectively_i,jIndicating the distance between the drone base station and the UE, α_LOSAnd α_NLOSPath LOSs indices for LOS and NLOS, respectively.

In the established system model, each drone has the same transmission power, and since small-scale fading exists in the channel gain model, in order to eliminate the small-scale fading, the UE performs sampling averaging on the received signal during access selection, and the average received signal strength adopted can be represented as:

wherein, P_tFor transmission of unmanned aerial vehicle base stationThe output power, N, represents the number of signal samples to average.

Because all unmanned aerial vehicle basic stations utilize same spectrum resource to transmit, so ground UE when inserting an unmanned aerial vehicle and transmitting, can receive the interference from other unmanned aerial vehicles, the SINR of user department can be expressed as:

wherein

Representing the set of unmanned aerial vehicle base stations in the network, σ²Representing the noise power.

The user selects a suitable unmanned aerial vehicle base station to access in each time slot, and for the base station with a plurality of users accessing in a single time slot, the base station selects a Time Division Multiple Access (TDMA) form to serve the users, namely, the time slot is averagely divided into subslots with the same size as the number of the accessed users. The reception rate of the UE may be expressed as:

wherein B represents the frequency bandwidth used for base station transmission, N_j(t) represents the number of users accessed by the base station at that time.

Fig. 2 shows the proposed deep reinforcement learning framework. The frame is composed of 3 parts, namely an unmanned aerial vehicle base station, a central node and UE. The unmanned aerial vehicle base station is responsible for transmitting service, the central node is responsible for training neural network parameters of the UE, and the UE makes appropriate base station access selection in each decision phase. Each UE is provided with the same neural network as the central node, and the neural network parameters at the UE are obtained from the central node and can be regarded as a replica at the central node. Each UE is regarded as an independent individual in the framework, information interaction does not occur between the UE and the UE, and the UE independently selects the unmanned aerial vehicle base station to access and is responsible for transmitting network information of the UE to the central node.

For a single UE, other users and drone base stations may be considered as environments. Therefore, the whole information interaction process is composed of two parts, namely an interaction process between the UE and the environment, and a transmission process of experience information and network parameters between the UE and the central node. In each access selection stage, each UE selects a proper unmanned aerial vehicle base station to access according to the state of the UE. Since we mainly pay attention to the maximization of user throughput, and the receiving rate of a user mainly relates to the strength of a received signal and the number of access users of a base station, the number of user connections and the strength of the received signal are mainly used as state elements, and a specific state can be expressed as follows:

wherein u is_i,jThe binary indicator variable may also be referred to as an access indicator variable, and if "1" indicates that the base station is accessed, and if "0" indicates that the base station is not selected to be accessed. In the state design, the access indication variable u of the user at the last moment is included_i,j(t-1), the last time and the received signal strength at this time

And

number N of access users of each base station at last moment₀(t-1)，ω_i(t-1) represents the throughput of the UE at the previous time instant.

After making self access selection, the UE sends an access request to the selected unmanned aerial vehicle base station, and after receiving the request, the unmanned aerial vehicle provides transmission service for the UE. After all UE access decisions are made, the environmental information is updated, and the unmanned aerial vehicle base station counts the number of access users per se and sends new network information to each UE to form a new state of the UE. All the UEs transmit the original state transition, the access selection made, the throughput condition and the new state to the central node. And the central node calculates the reward function of each UE and perfects the experience information. The final reward function may be expressed as:

wherein the content of the first and second substances,

indicating the impact of the UE on the performance of other relevant users after making access selections. a is_i(t) and a_i(t-1) denotes the access actions taken by the user at time t and time t-1, respectively, C denotes the penalty for creating the handover, η is a control factor.

After collecting experience information of all the UEs, the central node stores all the information into a local storage in a queue form, and summarizes the experience information of all the users. And then the central node randomly samples the data from the central node by using a random gradient descent method to serve as a training sample of the training, and the neural network parameters are trained. And after each training is finished, the central node sends the trained neural network parameters to each UE. And after acquiring the new neural network parameters, the UE updates the local parameters and makes a switching decision by using the updated neural network according to the new state of the UE.

Figure 3 shows the neural network architecture employed in the present invention. The neural network structure is composed of two parts of networks: LSTM networks and fully connected networks. The LSTM network is responsible for extracting time continuity features in input parameters, and data of M moments needs to be input simultaneously in the LSTM network; the full-connection network is responsible for processing the features extracted by the LSTM network to obtain a corresponding access strategy.

Fig. 4 shows the performance of the system throughput and the handover times of the access control technique proposed by the present invention under different handover penalty coefficients. Wherein, the test result is the result when the test time is 1000 time slots. It can be seen that the proposed access control method can achieve higher system throughput with a smaller number of handovers compared to conventional access control methods (received signal strength based access control methods and learning algorithm based access control methods). And under different switching punishment conditions, the proposed access control technology can realize the optimal performance, and different compromises between the switching times and the system throughput can be realized by adjusting different switching punishment items.

Claims

1. An unmanned aerial vehicle network multi-user access control method based on deep reinforcement learning is used for a system which takes an unmanned aerial vehicle as a mobile base station to provide service for ground user UE, and is characterized in that the control method comprises the following steps:

constructing a deep reinforcement learning framework for distributed decision-making centralized training, namely configuring a neural network with the same structure for each UE, and independently acquiring a strategy of accessing an unmanned aerial vehicle base station by each UE according to the neural network of each UE; meanwhile, a central node with the same neural network is arranged and used for collecting experience information from each UE and training neural network parameters, and the central node transmits the trained parameters to each UE after each training stage is completed;

the specific method for the central node to collect experience information from each UE is as follows:

the UE needs to select a proper action according to its own state, and obtains a corresponding reward after execution, and the throughput of the UE is mainly related to the number of access users of the base station and the strength of the received signal, so the states of i UEs are expressed as:

wherein u is_i,jThe defined access indication variable is a binary indication variable, namely, 1 indicates that the jth unmanned aerial vehicle base station is accessed, and 0 indicates that the jth unmanned aerial vehicle base station is not selected to be accessed; the state includes the access indicator variable u of the user at the last moment_i,j(t-1), the last time and the received signal strength at this time

And

number N of access users of each base station at last moment₀(t-1)，ω_i(t-1) represents the throughput of the UE at the previous time instant;

after making self access selection, the UE sends an access request to a selected unmanned aerial vehicle base station, and after receiving the request, the unmanned aerial vehicle provides transmission service for the UE;

after all UE access decisions are made, the environmental information is updated, and the unmanned aerial vehicle base station counts the number of access users per se and sends new network information to each UE to form a new state of the UE; all the UEs transmit the original state, the access selection made, the throughput condition and the new state to the central node, the central node calculates the reward function of each UE, and perfects the experience information:

wherein, ω is_i(t) represents the throughput of the UE at the current time instant,

indicating the change in throughput of the UE to other relevant users after making an access selection, defined as the impact on other users' performance,a _i(t) Anda _i(t-1) represents the user istTime of day andt1 access actions taken respectively, C represents a penalty for generating a handover,ηis a control coefficient.

2. The unmanned aerial vehicle network multi-user access control method based on deep reinforcement learning of claim 1, wherein the specific method for the central node to train neural network parameters is as follows:

after the central node collects experience information of all the UEs, all the information is stored in a local memory in a queue form, the experience information of all the UEs is collected, random sampling is carried out by using a random gradient descent method, and an obtained sample is used as a training sample of the training to train the neural network parameters.

3. The deep reinforcement learning-based unmanned aerial vehicle network multi-user access control method according to claim 2, wherein the neural network is composed of a long-short term memory network and a fully-connected network: the long-short term memory network is responsible for extracting time continuity characteristics in input parameters, and data of M moments needs to be input simultaneously in the long-short term memory network; the full-connection network is responsible for processing the features extracted by the long-term and short-term memory network to obtain a corresponding access strategy.