CN115314904A

CN115314904A - Communication coverage method and related equipment based on multi-agent maximum entropy reinforcement learning

Info

Publication number: CN115314904A
Application number: CN202210674727.2A
Authority: CN
Inventors: 许文俊; 吴思雷; 林兰; 李国军; 王凤玉; 张天魁
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-06-14
Filing date: 2022-06-14
Publication date: 2022-11-08
Anticipated expiration: 2042-06-14
Also published as: CN115314904B

Abstract

The application provides a post-disaster communication coverage method based on multi-agent maximum entropy reinforcement learning and related equipment. The method comprises the steps of recovering ground communication service for users after disasters in a multi-unmanned aerial vehicle base station hybrid networking mode, providing a distributed clustering-track layered air coverage optimization structure, realizing large-scale user clustering with high load efficiency and high balance by a distributed k-sum algorithm at the bottom layer, and optimizing the flight tracks of the multi-unmanned aerial vehicle base stations by a distributed training-distributed execution MASAC neural network by combining clustering results at the upper layer, so that the communication interruption probability of the network is reduced, and the air coverage optimization of large-scale users after disasters is realized. Under the assistance of the integrated learning technology, the MASAC algorithm solves the problems of unstable multi-agent training environment and poor algorithm convergence stability caused by deterministic strategy gradient, and finally achieves the beneficial effect of reducing the communication interruption probability of the emergency communication network.

Description

Communication coverage method and related equipment based on multi-agent maximum entropy reinforcement learning

Technical Field

The application relates to the technical field of unmanned aerial vehicle emergency communication, in particular to a communication coverage method based on multi-agent maximum entropy reinforcement learning and related equipment.

Background

After a serious natural disaster, the ground base station is damaged to cause communication interruption in a disaster area, important rescue information of large-scale ground users is blocked, and life and property safety of the users after the disaster is seriously damaged. The unmanned aerial vehicle has the characteristics of rapid deployment and flexible regulation, can be configured with the emergency base station to provide an air-to-ground efficient communication link for ground users, and optimizes the communication coverage performance by regulating and controlling the flight trajectories of all unmanned aerial vehicle base stations in real time. However, the dynamically unknown communication environment, the number scale of users, makes the air coverage optimization for large-scale disaster-stricken users extremely challenging. The deep reinforcement learning method can utilize a large amount of flight data to carry out self-learning, fit unknown environment and deal with the dynamics of certain communication environment. However, large-scale disaster-stricken users cause strong network environment dynamics, and the related deep reinforcement learning method still has the problems of poor stability, slow convergence, calculation dimension explosion and the like.

Disclosure of Invention

In view of the above, the present application aims to provide a communication coverage method based on multi-agent maximum entropy reinforcement learning and a related device to solve the above problems.

Based on the above purpose, a first aspect of the present application provides a post-disaster communication method based on multi-agent maximum entropy, where a plurality of unmanned aerial vehicle base stations establishing communication connection in a hybrid networking manner form a communication network capable of covering a preset area, the communication network provides communication services for all users located in the preset area, and for any one of the unmanned aerial vehicle base stations in the communication network, the communication coverage method based on multi-agent maximum entropy reinforcement learning includes:

acquiring local observation information at the current moment;

based on the local observation information, clustering the users located in the preset area at the current moment by using a distributed clustering k-sum algorithm to obtain a clustering result;

characterizing the local observation information and the clustering result into a current state;

selecting a multi-agent maximum entropy reinforcement learning MASAC neural network from the trained neural network set as a target MASAC neural network;

inputting the current state into the target MASAC neural network to obtain a regulation action;

and controlling the flight track of the unmanned aerial vehicle base station based on the regulation and control action.

A second aspect of the present application provides a communication overlay apparatus based on multi-agent maximum entropy reinforcement learning, comprising:

an information acquisition module configured to: acquiring local observation information at the current moment;

a user clustering module configured to: based on the local observation information, clustering the users located in the preset area at the current moment by using a distributed clustering k-sum algorithm to obtain a clustering result;

a feature transformation module configured to: characterizing the local observation information and the clustering result into a current state;

a model selection module configured to: selecting a target multi-agent maximum entropy reinforcement learning MASAC neural network from the trained neural network set;

an action acquisition module configured to: inputting the current state into the target MASAC neural network to obtain a regulation action;

an action execution module configured to: and controlling the flight track of the unmanned aerial vehicle base station based on the regulation and control action.

A third aspect of the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method as provided by the first aspect of the present application when executing the program.

A fourth aspect of the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method provided by the first aspect of the present application.

From the above, it can be seen that the present application provides a communication coverage method and related device based on multi-agent maximum entropy reinforcement learning. Firstly, local observation information at the current moment is obtained; then, based on local observation information, clustering users located in a preset area at the current moment by using a distributed clustering k-sum algorithm to obtain a clustering result; characterizing local observation information and clustering results into a current state; selecting a multi-agent maximum entropy reinforcement learning MASAC neural network from the trained neural network set as a target MASAC neural network; inputting the current state into a target MASAC neural network to obtain a regulation action; and finally, controlling the flight track of the unmanned aerial vehicle base station based on the regulation and control action. The method comprises the steps that ground communication service is recovered for users after disasters in a Multi-unmanned aerial vehicle base station hybrid networking mode, a distributed clustering-track layered aerial coverage optimization structure is provided, large-scale user clustering with high load efficiency and high balance is achieved through a distributed k-sum algorithm at the bottom layer, flight tracks of the Multi-unmanned aerial vehicle base stations are optimized through a distributed training-distributed execution MASAC (Multi-Agent software translator Critic) algorithm in combination with clustering results at the upper layer, under the assistance of an integrated learning technology, the problems that a Multi-intelligent-body training environment is unstable and algorithm convergence stability caused by deterministic strategy gradients is poor are solved through the MASAC algorithm, and the beneficial effect of reducing the communication interruption probability of an emergency communication network is achieved finally. The clustering of ground users and the flight tracks of multiple unmanned aerial vehicle base stations are regulated and controlled by a distributed training-distributed executing architecture, the communication interruption probability of a network is reduced, and the aerial coverage optimization of large-scale post-disaster users is realized.

Drawings

In order to more clearly illustrate the technical solutions in the present application or related technologies, the drawings required for the embodiments or related technologies in the following description are briefly introduced, and it is obvious that the drawings in the following description are only the embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a large-scale emergency communication network powered by multiple unmanned aerial vehicle base stations according to an embodiment of the present application;

FIG. 2 is a flowchart of a communication coverage method based on multi-agent maximum entropy reinforcement learning according to an embodiment of the present application;

FIG. 3 is a flowchart of user clustering according to an embodiment of the present application;

FIG. 4 is a flowchart of an iterative optimization method of a k-sums algorithm according to an embodiment of the present application;

FIG. 5 is a flow chart of sample playback in an embodiment of the present application;

FIG. 6 is a flow chart of sample construction according to an embodiment of the present application;

FIG. 7 is a flowchart of training a MASAC neural network according to an embodiment of the present application;

FIG. 8 is a diagram of a multi-agent reinforcement learning MASAC agent structure according to an embodiment of the present application;

FIG. 9 is a flowchart of an embodiment of the present application for obtaining a new target MASAC neural network;

FIG. 10 is a schematic diagram of an architecture for implementing a stable convergence technique based on ensemble learning according to an embodiment of the present application;

FIG. 11 is a block diagram of a communication overlay apparatus based on multi-agent maximum entropy reinforcement learning according to an embodiment of the present application;

fig. 12 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to the accompanying drawings in combination with specific embodiments.

It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application do not denote any order, quantity, or importance, but rather the terms are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used only to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

It should be noted that, this application embodiment mainly relates to two kinds of key technologies of unmanned aerial vehicle emergency communication technique and deep reinforcement learning.

The unmanned aerial vehicle emergency communication technology is regarded as an indispensable core technology in an emergency communication network, and a movable base station is configured on an unmanned aerial vehicle, and the flight trajectory of the unmanned aerial vehicle base station is reasonably regulated and controlled so as to meet the communication requirements of ground disaster-stricken users. Due to the unique air-to-ground channel model provided by the unmanned aerial vehicle base station and the high-dynamic three-dimensional flight capability, the unmanned aerial vehicle emergency communication network is more complex compared with the traditional communication network. The flight trajectory of each unmanned aerial vehicle base station determines the communication rate of the unmanned aerial vehicle and the ground user, and the interference to other unmanned aerial vehicle base stations directly influences the coverage performance of the whole network. The flight tracks of the unmanned aerial vehicle base stations are jointly regulated, so that the communication interruption probability of the emergency unmanned aerial vehicle communication network can be effectively reduced.

The deep reinforcement learning is a machine learning technology which can be used for solving the decision problem of an unknown and dynamic environment, and is characterized in that the method comprises the steps of 'exploring-utilizing', fitting the state-action cost function of the environment through 'exploring', and selecting the action of maximizing the state-action cost function through 'utilizing' as decision output. Under the scene that a plurality of agents exist simultaneously, the strategy improvement of each agent can lead the training environment of other agents to be unstable, and meanwhile, the possibility of malignant game among the agents is avoided.

In the related technology, the optimization algorithm needs to acquire the network environment information at all times from the start of a task to the end of the task, and optimize the flight trajectories of the multiple unmanned aerial vehicles by combining the global information at all times. For example: and establishing an average field type game target based on interference interaction, and obtaining the optimal trajectory planning of the multiple unmanned aerial vehicles by using an optimization method through iterative computation after obtaining global determination information. Or the distributed prediction module and the fuzzy target module are used for dealing with real-time actions of other unmanned aerial vehicles, so that a malignant game generated among the unmanned aerial vehicles is avoided, and the track optimization results of the multiple unmanned aerial vehicles are mutually self-consistent.

However, when the algorithm in the above related art performs network coverage optimization, there are problems that "network state is difficult to obtain, parameter dimension is difficult to generalize, and dynamic change is difficult to adjust". Firstly, as the emergency communication service types have unknown differences, and the user positions and the activation states have dynamics, the base station of the unmanned aerial vehicle is difficult to acquire or accurately predict all network state information in a future period of time, the requirement of a non-convex optimization method on the network state cannot be met, and the result obtained by optimizing the static network snapshot only deviates from the actual optimal condition. Secondly, the related optimization algorithm is limited by the fixed optimization time length, the number of the unmanned aerial vehicles and the number of the users, and with the increase of the time length, the number of the unmanned aerial vehicles and the number, the calculation amount for solving the non-convex optimization problem is increased in an explosive manner, so that the requirement for quickly regulating and controlling the flight trajectory of the unmanned aerial vehicles is difficult to meet. Thirdly, after the network state changes dynamically, the unmanned aerial vehicle base station needs to perform a round of complex non-convex optimization solution calculation again, the utilization efficiency of historical data is low, and redundancy exists in calculation under the condition that partial communication network states are similar.

On the other hand, the unmanned aerial vehicle communication network coverage optimization method based on reinforcement learning can effectively cope with the characteristics of unknown and dynamic communication network environment. For example: in the related art, the Multi-Agent Deep reinforcement learning MADDPG (Multi-Agent Deep dependent Policy Gradient) algorithm-based Multi-unmanned aerial vehicle network hovering position optimization method improves network throughput, ensures fairness of the unmanned aerial vehicle to ground user service, reduces energy consumption loss, and enables an unmanned aerial vehicle cluster to adapt to a dynamic environment. Or, the unmanned aerial vehicle cluster high-efficiency communication method based on the multi-agent deep reinforcement learning MADDPG algorithm solves the problem of unmanned aerial vehicle cluster centralized information interaction under the condition of low communication overhead by using a centralized training-distributed execution architecture, and gives the unmanned aerial vehicle autonomous decision-making power to optimize unmanned aerial vehicle cluster communication performance in a distributed mode.

However, the algorithm in the related art regulates and controls the flight trajectories of the multiple unmanned aerial vehicle base stations by fitting a global state-action cost function, and has the problems of weak expansion capability, high communication overhead and unstable convergence. Firstly, the input dimension of the fitting global state-action cost function is positively correlated with the number of unmanned aerial vehicle base stations, so that the neural network is huge in scale and difficult to expand; secondly, all environment state information of the communication network needs to be gathered in the centralized training process, and huge communication overhead is needed in a large-scale emergency communication network scene; thirdly, the MADDPG algorithm based on the deterministic strategy output has the problems of large influence of over-parameters and large convergence performance fluctuation, and the convergence is unstable in a dynamic environment.

In the embodiment of the application, a communication coverage method based on multi-agent maximum entropy reinforcement learning is provided. A multi-unmanned aerial vehicle base station hybrid networking mode is used for recovering ground communication service for users after disasters, and a distributed clustering-track layered aerial coverage optimization structure is provided. On the bottom layer, based on the obtained local observation information, clustering the dynamic users in the preset area at the current moment by using a distributed clustering k-sum algorithm to obtain a clustering result; and large-scale user clustering with high load efficiency and high balance is realized. On the upper layer, the flight trajectories of the multi-unmanned-aerial-vehicle base stations are optimized by combining clustering results through a distributed training-distributed execution MASAC neural network algorithm, under the assistance of an integrated learning technology, the MASAC neural network algorithm solves the problems that a multi-agent training environment is unstable and the convergence stability of the algorithm is poor due to deterministic strategy gradients, and finally the beneficial effect of reducing the communication interruption probability of the emergency communication network is achieved. The clustering of ground users and the flight tracks of multiple unmanned aerial vehicle base stations are regulated and controlled by a distributed training-distributed executing architecture, the communication interruption probability of a network is reduced, and the aerial coverage optimization of large-scale post-disaster users is realized.

Referring to fig. 1, a schematic view of an application scenario of a multi-agent maximum entropy reinforcement learning-based communication coverage method provided by an embodiment of the present application is shown. The application scenario includes a drone base station 101 and a user cluster 102. The unmanned aerial vehicle base station 101 and the user cluster 102, and the users in the user cluster 102 can be connected through a wired or wireless communication network, and the intelligent agent is deployed on the unmanned aerial vehicle base station 101. A plurality of unmanned aerial vehicle base stations 101 are deployed in an emergency unmanned aerial vehicle communication network, communication overhead can be carried out between adjacent unmanned aerial vehicle base stations 101, and the plurality of unmanned aerial vehicle base stations 101 provide downlink service for ground users in a star-shaped and cluster-shaped mixed networking mode to meet communication coverage requirements.

Specifically, the drone base station 101 obtains local observation information of a current time in a preset area (e.g., a disaster area) through an observation component (e.g., a camera, a thermal imager, a sensor, etc.). The unmanned aerial vehicle base station 101 is provided with a server which is communicated with the intelligent agent, the server clusters users on the ground based on the local observation information to obtain a plurality of user clusters 102, one cluster center user is selected from each user cluster 102 to be used for forwarding information of the unmanned aerial vehicle base station 101 to other users in the user cluster 102, and the user cluster 102 and the center user are used as clustering results. The server characterizes local observation information and clustering results into a current state, selects a multi-agent maximum entropy reinforcement learning MASAC neural network from a neural network set trained by the agents as a target MASAC neural network, and inputs the current state into the target MASAC neural network so as to control the flight trajectory of the unmanned aerial vehicle base station 101. The unmanned aerial vehicle base station 101 can serve a plurality of user clusters 102 on the ground at the same time and reduce mutual interference.

The method for constructing the model according to the exemplary embodiment of the present application is described below with reference to the application scenario of fig. 1. It should be noted that the above application scenarios are only presented to facilitate understanding of the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable. ,

in some embodiments, as shown in fig. 1, a plurality of drone base stations that establish communication connection in a hybrid networking manner form a communication network that can cover a preset area, and the communication network provides communication service for all users located in the preset area, for any one drone base station in the communication network. As shown in fig. 2, it should be noted that step 100 and step 200 in fig. 2 belong to the user clustering process of the bottom layer; and steps 300 to 600 belong to the upper track optimization process. The communication coverage method based on multi-agent maximum entropy reinforcement learning comprises the following steps:

step 100: and acquiring local observation information at the current moment.

In this step, since the disaster area is wide, it is difficult for a single drone base station to obtain global information (the local observation information obtained by all the drone base stations forms global information), so a distributed execution mode is selected, and each drone base station performs user clustering and intra-cluster center user selection based on the local information, so that local observation information at the current time needs to be obtained first, wherein the drone base stations can be in communication connection to share the local observation information at the respective current time.

Step 200: based on the local observation information, clustering the users in the preset area at the current moment by using a distributed clustering k-sum algorithm to obtain a clustering result.

In the step, in the process of clustering the bottom-layer users, the unmanned aerial vehicle base station screens out the users needing service based on local observation information in a distributed mode, divides the users to be served into a plurality of user clusters, respectively selects a cluster center user for forwarding communication information, and takes the user cluster and the center user as a clustering result.

Step 300: and characterizing the local observation information and the clustering result into a current state.

In the step, in the face of an emergency communication network environment with unknown dynamic state, the reinforcement learning utilizes a Markov Decision Process (MDP) to carry out modeling, and an observed value is obtained from the communication network environment as a current state s _t . Wherein, each unmanned aerial vehicle base station draws some observable information as input state, can characterized as follows:

1) Coordinates of the unmanned aerial vehicle base station itself;

2) Two-dimensional relative position and activation state with the central user of the self-served user cluster;

3) A three-dimensional relative position to an adjacent drone;

4) Two-dimensional relative position and activation status with a central user of a cluster of users served by individual neighboring drones.

Step 400: and selecting a multi-agent maximum entropy reinforcement learning MASAC neural network from the trained neural network set as a target MASAC neural network.

In this step, the intelligence loaded by each drone base station will train a set of neural networks simultaneously, forming a set of neural networks.

Step 500: and inputting the current state into a target MASAC neural network to obtain a regulation action.

In the step, the unmanned aerial vehicle base station can move freely in a three-dimensional space, and the regulation and control actions of the unmanned aerial vehicle base station comprise three directions of an x axis, a y axis and a z axis.

Step 600: and controlling the flight track of the unmanned aerial vehicle base station based on the regulation and control action.

The unmanned aerial vehicle base station can freely move in a three-dimensional space, and the execution action of the unmanned aerial vehicle base station based on the regulation action output can be characterized as the moving speed in the three directions of an x axis, a y axis and a z axis. And the upper layer combines the clustering result to optimize the flight tracks of the multiple unmanned aerial vehicle base stations by a MASAC algorithm of distributed training-distributed execution. The distributed training process is selected, so that the problems of weak expansion capability and high communication overhead in a centralized training-distributed execution multi-agent deep reinforcement learning MADDPG algorithm are solved. For the problem of unstable convergence, under the assistance of the integrated learning technology, the MASAC algorithm solves the problems of unstable multi-agent training environment and poor convergence stability of the algorithm caused by deterministic strategy gradient, and finally achieves the effect of reducing the communication interruption probability of the emergency communication network.

In some embodiments, for example, assuming that a preset area (disaster area) has N users in total, M drone base stations are deployed, each drone base station will serve K ground user clusters, users and drone base stations are respectively represented by a set N and a set M, a set of users in all cluster centers is represented by K, and a set of drone base stations M serving users is represented by N _m Indicating that K is used for the set of cluster-centric users of the service _m Indicating that the set of other users in the cluster of the cluster-centered user k is N _k And (4) showing. After a large-scale disaster, a user can change the position of the user in real time, the activation state is randomly activated along with the change of time, and the dynamic property is strong, so that the activation state of the user i belongs to the field of 0,T at the current moment t]Time-course intake was from Beta distribution:

wherein k is ₁ And k ₂ Is a parameter of Beta distribution, T represents the current time, and T represents the total task duration. f. of _i (t) represents the probability that the activation coefficient b is equal to 1 at the current instant. If the user is in an active state, the activation coefficient b =1, there is a transmission task at the current moment, a communication link needs to be established with the nearest base station of the unmanned aerial vehicle, and the communication link is allocated to the spectrum resource block a. On the contrary, the method can be used for carrying out the following steps,if the user is not in the activated state, the activation coefficient b =0, and a communication link is not required to be established.

As shown in fig. 3, step 200: based on local observation information, clustering users located in a preset area at the current moment by using a distributed clustering k-sum algorithm to obtain a clustering result, which specifically comprises the following steps:

step 210: and converting the local observation information into a clustering kernel matrix.

In this step, the clustering kernel matrix of the distributed k-sum algorithm

Using dissimilarity measure representation of observable users in local observation information, wherein

The number of users observable by the unmanned aerial vehicle base station m, and the adjacent coefficient L represents the magnitude of the observation information of the unmanned aerial vehicle base station. The dissimilarity measure between users is then determined by the current time of day user i ₁ Number of loaded resource blocks required for transmission to user

And user activated state

Is characterized by the product of

Wherein, among others,

representing a top rounding operation, N _c Is a resource block loading threshold, in case the user occupies too many spectrum resource blocks due to low spectral efficiency,

is the terrestrial communication rate, which represents the transmission rate of terrestrial communication between two users.

Wherein the content of the first and second substances,

representing the magnitude of the channel gain for terrestrial communications between two users,

representing the magnitude of the channel loss of terrestrial communication between two users, B _ground Which represents the bandwidth of the terrestrial communications,

representing the center frequency of the terrestrial communication,

representing the distance, η, between the user at the center of the cluster and the other users within the cluster _NLoS The additional spatial propagation loss representing the NLoS link can be considered as a constant in the calculation.

Step 220: and constructing an initial adjacent clustering identification matrix based on the distance between the unmanned aerial vehicle base station and the user.

In this step, for each drone base station, the distributed k-sum user clustering algorithm only needs to obtain the nearest O _m Adjacent clustering identification matrix Y of individual users _p . So, first determine the nearest O based on the distance between the drone base station and the user _m And (4) users. Then, with y _n,0 =1 indicates that user n is not in all user clusters currently served by drone base station, and y _n,k＞0 =1 denotes that user n is in the kth user cluster served by the current drone base station and exists

Meanwhile, in order to guarantee the constraint condition of the cluster result balance, the adjacent cluster identification matrix Y _p Should satisfy

The elements of the neighbor cluster identity matrix can be defined as follows:

the following operations are performed for each iteration of the distributed clustering k-sum algorithm:

step 230: optimizing the initial identification matrix based on the clustering core matrix to obtain an optimized adjacent clustering identification matrix; assigning a value of the optimized neighbor cluster identity matrix to the initial neighbor cluster identity matrix in response to determining that the initial neighbor cluster identity matrix and the optimized neighbor cluster identity matrix are not equal.

Step 240: and ending the iteration process until the initial adjacent clustering identification matrix is equal to the optimized adjacent clustering identification matrix to obtain a plurality of user clusters.

Step 250: based on preset selection conditions, selecting a central user which establishes communication connection with the unmanned aerial vehicle base station from each user cluster; and the clustering result comprises all the user clusters and the central user.

Wherein, as shown in FIG. 4, the row vector iteration of the k-sum algorithm is optimizedThe method can sequentially optimize the local clustering identification row vector y of each user _n ＝[y _n,0 ,y _n,1 ,...,y _n,K ]Such as

Wherein, the first and the second end of the pipe are connected with each other,

the local clustering identification matrix before optimization is subjected to a round of row vector y _n The optimization is kept unchanged. And g _n Representing a neighboring clustering kernel matrix G _p The column vector of (2). After optimized iteration of Toronto is carried out until the initial adjacent clustering identification matrix is equal to the optimized adjacent clustering identification matrix, the iteration process is ended, and the row vector-based iterative optimization result Y is obtained _p Screening all the genes y _n,k The k-th user cluster served by the current drone base station is set as the user cluster of = 1. If the preset selection condition is that the user with the minimum dissimilarity measure is selected as the central user of the user cluster, the user with the minimum dissimilarity measure is calculated through the following formula:

in some embodiments, as shown in fig. 5, while controlling the flight trajectory of the drone base station, the method further includes:

step 700: and constructing a sample of the unmanned aerial vehicle base station at the current moment.

Wherein, construct the sample of unmanned aerial vehicle base station present moment, as shown in fig. 6, specifically include:

step 710: and obtaining the reward of the communication performance of the unmanned aerial vehicle base station at the current moment from the communication network.

In this step, the reward function is designed to minimize the probability of communication interruption of the emergency communication network, and may be: giving penalty values when communication interruption occurs between the unmanned aerial vehicle base station and the adjacent unmanned aerial vehicle base station, if:

wherein, P _outage,-m (t) is the probability of communication interruption for the drone base station m serving the user, of

Wherein the content of the first and second substances,

is the air-to-ground communication rate, similar to the ground communication rate in step 210, which can be calculated as follows:

step 720: transmitting the reward and the regulatory action to a plurality of neighboring drone base stations and receiving a plurality of neighboring regulatory actions and neighboring rewards transmitted by the plurality of neighboring drone base stations.

In the step, the adjacent unmanned aerial vehicle base stations share the rewards and the regulation and control actions, so that the information is interactive, and complete training data is provided for subsequent model training.

Step 730: and calculating to obtain a subsequent state at the next moment by using a state transition distribution function based on the regulation action and the current state.

In this step, the strategy pi (a) is selected according to the action _t |s _t ) Output regulation action a _t Performing a regulating action a _t Reward r for obtaining environment interaction feedback _t Distribution p through state transitions _π (s _t+1 |s _t ,a _t ) Transition to the next moment state s _t+1 。

Step 740: and combining the current state, the regulation action, the reward, the subsequent state and the adjacent regulation action to obtain a sample.

In this step, the current state, the regulatory action, the reward, the subsequent state, and the adjacent regulatory action are combined into a set to obtain a sample of the current time of the play of the man-machine base station.

Step 800: sending the sample to a pre-constructed experience playback pool; wherein the experience replay pool is used to train the MASAC neural network.

In the step, each unmanned aerial vehicle base station sends the sample of the current moment to a constructed experience playback pool, and provides a data sample set for model training.

In some embodiments, the distributed training represents that an agent is deployed on each unmanned aerial vehicle base station, and the agent is used for training a MASAC neural network, wherein the MASAC neural network comprises a policy function Actor neural network and a double action value function doublq neural network, and the Actor neural network is used for receiving a current state and outputting a regulation action;

the training process of any MASAC neural network, as shown in fig. 7, includes:

step 001: a sample is taken from the empirical recovery pool.

Step 002: extracting the current state, the regulation action, the reward, the subsequent state and the adjacent regulation action in the sample.

Step 003: inputting the subsequent state into a TargetActor neural network to obtain the target action at the next moment; wherein, the TargetActor neural network is a copy network of the Actor neural network.

Step 004: and sending the target action to a plurality of adjacent unmanned aerial vehicle base stations and receiving a plurality of adjacent target actions sent by the plurality of adjacent unmanned aerial vehicle base stations.

Step 005: a timing differential error is calculated based on the current state, the regulatory action, the reward, the successor state, and the neighboring regulatory action and the neighboring target action.

Step 006: and updating the double action cost function double action Q neural network based on the time sequence difference error to obtain a state-action cost function.

Step 007: updating the Actor neural network and the Target neural network based on the state-action value function, and finishing the training of the MASAC neural network; the Target neural network is an Actor neural network and a duplicate network of a DoubleQ neural network, and comprises a TargetActor neural network.

In the distributed training, W groups of independent sample sets D are randomly taken out from an experience playback pool respectively ₁ ,D ₂ ,...,D _W And respectively training all the multi-agent maximum entropy reinforcement learning MASAC neural networks in the W. As shown in fig. 8, each MASAC neural network is composed of 6 sub-neural networks. Wherein, the Actor neural network represents an action selection strategy

Is a neural network parameter, inputs a local observation state

Mean value of output distribution of action under observation state

And standard deviation of

To indicate an action selection policy

The network will be used to output the airspeeds in three directions for distributed trajectory optimization of the drone base station in step 600 after training. To optimize the Actor neural network, other neural networks for MASACs are set up as follows. The Double Q neural network consists of two neural networks of Critic1 neural network and Critic2 neural network, and the two neural networks are respectively fitted with a proximity state-value function

And

the neural network parameters are respectively

And

fitting two state cost functions can solve a single Critic network causing overestimation of the state cost function. The Target network consists of three neural networks, targetActor neural networks

Target Critic1 neural network

And Target critical 2 neural network

The neural network parameters are respectively

And

the neural networks at the beginning of the three targets are an Actor network, a Critic1 network and a copy network of the Critic network respectively,but the parameter updating rate is slower, the stability of the training process can be improved, and the convergence speed of the algorithm is accelerated.

Specifically, the action selection policy aims at maximizing the state-action cost function, so the optimization goal of the Actor network can be expressed as:

because the output of the Actor network is a distribution function rather than a specific action value, and numerical representation of the output action is required in the process of calculating the optimized target gradient, a weight parameter technique (reconstruction lock) is adopted to output an estimated action:

wherein epsilon _t Is a gaussian noise vector with mean 0 and independent of the action output strategy. The criticic network targets a fitted state-action cost function, so the optimization objective can be expressed in terms of timing difference error:

combining the above optimization objectives, the network parameters are updated as follows

Wherein η is the neural network update step length.

The intelligent agent carries out a distributed execution process and a distributed training process through alternative iteration, namely, the flight trajectory of the unmanned aerial vehicle base station is controlled and the MASAC neural network is trained and optimized through alternative iteration. The method comprises the steps of obtaining a new sample from the environment of a communication network, storing the new sample in an experience playback pool, and randomly obtaining batch sample training neural network parameters from the experience playback pool, so that an intelligent agent learns an optimal action output strategy to obtain an optimal flight track of an unmanned aerial vehicle base station, and communication interruption between the unmanned aerial vehicle base stations is avoided while communication coverage of a preset area is guaranteed.

In some embodiments, as shown in fig. 9, after sending the sample to the experience playback pool, the method further includes:

step 900: the cumulative reward of the target MASAC neural network is updated.

In this step, in the process of performing distributed execution to control the flight trajectory of the drone base station, in order to stabilize the convergence performance of the multi-agent reinforcement learning algorithm, the embodiment of the present application incorporates an ensemble learning technique, and a specific process is shown in fig. 10. The intelligent agents loaded by each unmanned aerial vehicle base station can train W groups of neural networks at the same time to form an integrated learning neural network set W. In the distributed execution stage, the intelligent agent randomly samples from the neural network set W to obtain a group of target neural networks W, inputs the characterized network state into the operator neural network module of W, and decides the action of the unmanned aerial vehicle base station. Unmanned base station performs actions and obtains reward r from network environment _m And sharing regulation and control actions and reward information with an adjacent unmanned aerial vehicle base station under the assistance of communication overhead, and storing all samples at the current moment into an experience playback pool for a distributed model training process. And updates the jackpot of the target new neural network w

Wherein, tau _w Is the nerveUpdate step size of the network's jackpot.

Step 1000: the maximum cumulative reward in the set of neural networks is updated.

In this step; updating maximum jackpot in neural network set W

Step 1100: in response to determining that the cumulative reward is less than the maximum cumulative reward, pruning the target MASAC neural network and selecting a new target MASAC neural network from the set of neural networks.

The method specifically comprises the following steps: determining the MASAC neural network with the maximum accumulated reward value in the neural network set excluding the target MASAC neural network; the MASAC neural network is replicated as a new target MASAC neural network. I.e. if the cumulative prize of the target neural network w

Maximum accumulated reward substantially less than the neural network set

Pruning is performed on the neural networks W and the neural network with the largest jackpot value among the remaining neural networks in the set W of neural networks is copied as the new target neural network W.

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In this distributed scenario, one device of the multiple devices may only perform one or more steps of the method of the embodiment of the present application, and the multiple devices interact with each other to complete the method.

It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a communication coverage device based on multi-agent maximum entropy reinforcement learning.

Referring to fig. 11, the multi-agent maximum entropy reinforcement learning-based communication coverage device includes:

an information acquisition module 10 configured to: and acquiring local observation information at the current moment.

A user clustering module 20 configured to: based on the local observation information, clustering the users in the preset area at the current moment by using a distributed clustering k-sum algorithm to obtain a clustering result.

A feature transformation module 30 configured to: and characterizing the local observation information and the clustering result into a current state.

A model selection module 40 configured to: and selecting a target multi-agent maximum entropy reinforcement learning MASAC neural network from the trained neural network set.

An action acquisition module 50 configured to: and inputting the current state into a target MASAC neural network to obtain a regulation action.

An action execution module 60 configured to: and controlling the flight track of the unmanned aerial vehicle base station based on the regulation and control action.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.

The apparatus of the foregoing embodiment is used to implement the corresponding communication coverage method based on multi-agent maximum entropy reinforcement learning in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-mentioned embodiments, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the program, the multi-agent maximum entropy reinforcement learning-based communication coverage method according to any of the above-mentioned embodiments is implemented.

Fig. 12 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (random access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).

The bus 1050 includes a path to transfer information between various components of the device, such as the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the foregoing embodiment is used to implement the corresponding communication coverage method based on multi-agent maximum entropy reinforcement learning in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-mentioned embodiment methods, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the multi-agent maximum entropy reinforcement learning-based communication overlay method according to any of the above embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The storage medium of the above embodiment stores computer instructions for causing the computer to execute the communication coverage method based on multi-agent maximum entropy reinforcement learning as described in any of the above embodiments, and has the beneficial effects of corresponding method embodiments, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims

1. A communication coverage method based on multi-agent maximum entropy reinforcement learning is characterized in that a plurality of unmanned aerial vehicle base stations which establish communication connection in a hybrid networking mode form a communication network capable of covering a preset area, the communication network provides communication service for all users located in the preset area, and for any one unmanned aerial vehicle base station in the communication network, the communication coverage method based on multi-agent maximum entropy reinforcement learning comprises the following steps:

acquiring local observation information at the current moment;

2. The method according to claim 1, wherein the clustering, based on the local observation information, the users located in the preset area at the current time by using a distributed clustering k-sum algorithm to obtain a clustering result specifically includes:

converting the local observation information into a clustering kernel matrix;

constructing an initial adjacent clustering identification matrix based on the distance between the unmanned aerial vehicle base station and the user;

performing the following for each iteration of the distributed clustering k-sums algorithm:

optimizing the initial identification matrix based on the clustering core matrix to obtain an optimized adjacent clustering identification matrix; assigning a value of the optimized neighbor cluster identity matrix to the initial neighbor cluster identity matrix in response to determining that the initial neighbor cluster identity matrix and the optimized neighbor cluster identity matrix are not equal;

ending the iteration process until the initial adjacent clustering identification matrix is equal to the optimized adjacent clustering identification matrix to obtain a plurality of user clusters;

based on preset selection conditions, selecting a central user which establishes communication connection with the unmanned aerial vehicle base station from each user cluster; wherein the clustering result includes all of the user clusters and the central user.

3. The method of claim 1, further comprising, after controlling the flight trajectory of the drone base station:

constructing a sample of the unmanned aerial vehicle base station at the current moment;

sending the sample to a pre-constructed experience playback pool; wherein the experience playback pool is used to train the MASAC neural network.

4. The method according to claim 3, wherein the constructing the sample of the drone base station at the current time specifically includes:

acquiring rewards of the communication performance of the unmanned aerial vehicle base station at the current moment from the communication network;

transmitting the reward and the regulatory action to a plurality of neighboring drone base stations and receiving a plurality of neighboring regulatory actions and neighboring rewards transmitted by the plurality of neighboring drone base stations;

calculating to obtain a subsequent state at the next moment by using a state transition distribution function based on the regulation action and the current state;

combining the current state, the regulatory action, the reward, the successor state, and the neighboring regulatory action to obtain the sample.

5. The method of claim 4, wherein one agent is deployed on each drone base station, and wherein the agent is configured to train the MASAC neural network, and wherein the MASAC neural network comprises a policy function Actor neural network and a double action cost function DoubleQ neural network, and wherein the Actor neural network is configured to receive the current state and output the regulatory action;

the training process of any MASAC neural network comprises the following steps:

taking one of said samples from said experience recovery pool;

extracting the current state, the conditioning action, the reward, the successor state, and the neighboring conditioning action in the sample;

inputting the subsequent state into the TargetActor neural network to obtain a target action at the next moment; wherein the TargetActor neural network is a replica network of the Actor neural network;

sending the target action to a plurality of adjacent unmanned aerial vehicle base stations and receiving a plurality of adjacent target actions sent by the plurality of adjacent unmanned aerial vehicle base stations;

calculating a timing differential error based on the current state, the regulatory action, the reward, the successor state, and the neighboring regulatory action and the neighboring target action;

updating the double action cost function (DoubleQ) neural network based on the time sequence difference error to obtain a state-action cost function;

updating the Actor neural network and the Target neural network based on the state-action value function, and finishing training of the MASAC neural network; wherein the Target neural network is a replica network of the Actor neural network and the DoubleQ neural network, and the Target neural network includes the TargetActor neural network.

6. The method of claim 4, further comprising, after sending the sample to the experience playback pool:

updating a cumulative reward of the target MASAC neural network;

updating a maximum cumulative reward in the set of neural networks;

pruning the target MASAC neural networks and selecting a new target MASAC neural network from the set of neural networks in response to determining that the cumulative reward is less than the maximum cumulative reward.

7. The method according to claim 4, wherein the selecting a new target MASAC neural network from the set of neural networks comprises:

determining a MASAC neural network with a greatest cumulative reward value among the set of neural networks excluding the target MASAC neural network;

replicating the MASAC neural network as the new target MASAC neural network.

8. A communication overlay apparatus based on multi-agent maximum entropy reinforcement learning, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 7.